IoT messaging systems

RabbitMQ QPID HiveMQ Mosquitto Kafka Redis
Focus broker, enterprise, multi protocoll broker, enterprise small footprint, non relaible networks small footprint, non relaible networks high througthput in memory db + messaging (text only)
Semantics queue (standart, priority), pub/sub queue (standart, priority, sorted, last value), pub/sub pub/sub only pub/sub only queue only queue, pub/sub
Routing default, direct, fanout, topic, header default, direct, fanout, topic, header pub/sub pattern pub/sub pattern missing pub/sub pattern
Min message 60-100 bytes 60-100 bytes 2 byte (header only) 2 byte (header only) 9 byte – queue, 11 byte – topic
Header Rich, JMS-like: ttl, corellation-id, priority etc. Rich, JMS-like: ttl, corellation-id, priority etc. Simple flags: type, qos, dupliction, retain Simple flags: type, qos, dupliction, retain Any serializable data Missing
History Missing Missing Retained message. Retained message. Always Queue as a list (access by index or range)
Consumer delivery Push API and Pull API Push only. Push only. Pull only. Pub/Sub – push, queues – pull
Reliability At least once – producer/consumer acknoledgement. Mandatory routing, redelivered flag, dead-letter queue. At least once – producer/consumer acknoledgement. Dead-letter queue. qos (message): 0 – at most once, 1 – at least once, 2 – exactly once qos (message): 0 – at most once, 1 – at least once, 2 – exactly once. At least once – send „acks“, client offsets. Reliable queue option.
Resilience qos + internal message queeuing, heartbeats, client takeover qos + internal message queeuing, client takeover replication + leader election Pub/Sub client output buffers
Failover/HA Master node + syncronous mirroring Synchronization. Load balancer + replication. HA bridge with replication Partition lead, sync replication, Mirroring Failover + async replication
Persistency Durable queue and messages with delivery_mode=2 (persistent). for qos 1 and 2 Durable connections (qos 1,2), retained message. Always, rolling log. only for queues, not for pub/sub channelsRDB (snapshot), ADB (journal)
Authentication PLAIN, AMQPLAIN, EXTERNAL plugin Cyrus SASL – PLAIN, CRAM-MD5, DIGEST-MD5, GSAPI (Kerberos) plugable – username/password, certificate plugable – username/password, PSK, SSL/TLS certificate SASL – (kerberos, plain, hash) username/password
Autorisation Plugable, regexp ACL. ACL Plugable – ACL + RBAC. Plugable – ACL on topic. Plugable Missing (security proxy)
Encryption SSL/TLS SSL TLS SSL/TLS SSL missing (SSL)
Special features last will, retained messages last will, retained messages message history + manuall offset control FiFo, LiFo, circular queue, manual queue control (insert, trim, len etc.)
Veröffentlicht in Allgemein, BigData, Messaging

Hadoop stream processing

Apache Storm Spark Apache Samza Apache Flink Apache Apex
Developer Hortonworks (Twitter) Databricks  LinkedIn dataArtisans DataTorrent
Computation model Storm – streaming
Trident – micro-batching
Micro-batching Streaming Streaming or batching Streaming with time boundaries
API Storm – programmatic
Trident – declarative
declarative programmatic declarative declarative
Resource Management Nimbus + Storm Supervisor YARN, Mesos, Standalone  YARN  YARN, Standalone  YARN
Parallelism OS process per task Configurable Multiple tasks in one thread DAG optimizer, configurable
Language support Java, Python Java, Scala, Python Java, Scala Java, Scala  Java, Scala
Internal Messaging Netty Netty + Akka buffer to disk ?  intra-process or disk-buffer
Back-pressure Limiting max messages
in topology at any time
+  relies on kaffka through buffer handover  automatic, buffering
Interactive processing over DRPC from spark 2.0  missing missing missing
Design Topology  Spark job (DAG) Samza jobs Flink job (DAG)
Message source/target  Spout Receivers Stream consumer  Source  Input/Output Operator
Message processing Bolt Transformations + Operations Task Operator => Task  Compute operator
Stream primitives Tuple DStream, RDD  Message  DataStream  Tuple
Kafka integration Kafka Spout  Receiver approach + DirectStream  thight  Kafka consumer  Kafka input
At most once Storm + auto ACK.
In order processing.
Receiver without checkpointing
In order processing.
not needed not used Stateless operators.
In order processing
At least once Storm + ACK on success.
Out of order.
Receiver with checkpointing.
In order processing.
In order processing
(within partition)
Out of order processing Out of order processing
Exactly once Only in Trident mode.
In order processing.
with WAL enabled
In orderprocessing.
missing In order processing In order processing
Kafka offset Zookeeper WAL Log Zookeeper Zookeeper Zookeeper or OffsetManager
Manipulating checkpoints possible not possible utility possible possible
State management Only in Trident mode WAL logs.
Disabled per default.
Replicated changelog Distributed snapshotting.
Disabled per default.
Operator Snapshot.
Enabled per default. Suppresable.
State storage Plugable, API.
JDBC, FileSystem, etc
FileSystem (HDFS, S3) Plugable, API.
LevelDB – Key/Value
(HDFS, In-Memory)
State behavior Cached state batches At end of the micro-batch Task local Time-Periodical
Chandy Lamport
at window boundary
Join streams code or trident-api + code + +
Merge streams + + + + +
Partitioning by grouping + code + +
Time based windows Storm – TickTuple
Trident – batch
+ yes + +
Count based windows code or trident-api hard code + possible
Sliding windows code or trident-api time code time and count based time
Latency Storm – Microseconds
Trident – subsecond
Subsecond Milliseconds
(disk buffering)
Milliseconds Milliseconds
Throughput Medium High Medium to High High High
Impact of fault Medium to High High Low High Low
Scalability horizontal (ack limit), rebalancing sheduler limit horizontal horizontal horizontal, dynamic



  • There is a storm-yarn wrapper implementation from Yahoo, which allows to start storm cluster in YARN containers, but is not a default execution scenario.
  • A bolt can maintain in-memory state (which is lost if that bolt dies), or it can make calls to a remote database to read and write state. However, a topology can usually process messages at a much higher rate than calls to a remote storage can be made, so making a remote call for each message quickly becomes a bottleneck.
  • Upstream bolt/spout failure triggers re-compute on entire tree. Failed tuples cannot be replayed quickly. Meets latency and throughput requirements only when no failure occurs.
  • Trident provides high level API for data manipulation like filtering, join and aggregation, but Trident implements this using standard Storm building blocks. Same logic could be implemented manually.


  • The same code can be used for batch and stream processing.
  • Many tools and integration possibilities with other systems: GraphX, MLib, SQL, Atlas Notebooks
  • Spark is storing an internal state in proprietary format (seems to be a serialized Java object graph), which is making upgrade of spark application more difficult (a graceful shutdown with empty state needed).
  • There is a possibility to set „maxRate“ or „maxRatePerPartition“ for Kafka streams, limiting max number of messages processed per micro-batch.


  • By co-locating storage and processing on the same machine, Samza is able to achieve very high throughput, even when there is a large amount of state.
  • Every input and output stream in Samza is fault-tolerant and replicated (buffering of messages between StreamTask’s) which makes „at most once“ processing obsolete – messages will be always processed.
  • Single threaded parallelism (container executes only one thread, but may contain multiple tasks) maps one thread to exactly one CPU which provides more predictable resource model and reduces interference from other tasks running on the same machine.
  • Different Samza jobs can communicate with each other over named streams, doing it more EAI like engine.


  • Broader toolset than Storm or Apex – ML, batch processing, SQL like queries.
  • Failures reset all operators back to the source; throughput and latency requirements can be met only if no errors occurs.


  • Outstanding latency and throughput even in failure scenarios.
  • Apex is Spark on steroids, using all the advantages of micro-batching (executing sequences of batch/windows in semi-parallel fashion) and avoid Spark drawbacks (no sheduling, no shuffling and serialization/deserialization in THREAD_LOCAL, CONTAINER_LOCAL or NODE_LOCAL topologies).

General notes:

  • Micro-batching limits the latency of jobs to that of the micro-batch. While sub-second batch latency is feasible for simple applications, applications with multiple network shuffles easily bring the latency up to multiple seconds.
  • Micro-batch architectures that use time-based batches have an inherent problem with backpressure effects. If processing of a micro-batch takes longer in downstream operations (e.g., due to a compute-intensive operator, or a slow sink) than in the batching operator (typically the source), the micro batch will take longer than configured. This leads either to more and more batches queueing up, or to a growing micro-batch size.
  • All well-designed streaming systems (not micro-batching) are not processing record-at-a-time but buffer many records before shipping them over the network (using messaging systems like Netty), while having continuous operators.

Apache streaming comparison
Benchmarking Streaming Computation Engines at Yahoo!
The world beyond batch: Streaming 101
The world beyond batch: Streaming 102
Apache Beam

Veröffentlicht in Allgemein, BigData, Java, Messaging

Hadoop file format comparison

Use case and environment

IoT datalake use case.
6000 devices (with unique ID), measuring 3 values 60 time per second (60 Herz).
One day of data (24 hours) – records in database.
Row in a table – [ID:int, timestamp:long, value1:float, value2:float, value3:long]

Hortonworks 2.4 Hadoop cluster – 16 DataNodes with 24 VCore each, 256 GB Memory
Hive 0.13, Spark 1.6, Tez


Find optimal table layout and file format to optimize:
1) Compression
2) SQL latency for data selection (few ID’s in one select).

Layout scenarios

Assumption – partitioning of table by ID (one subfolder for every ID value) or bucketing (same ID saved in one file, number of files = number of buckets) allows faster selection of data from particullar ID.

Nr File type Partitioning/bucketing Compr. Options Table size (GB) Files
 1 ORC ZLIB 256MB stripe, 10.000 stride 176,7 210
2 ORC ZLIB 256MB stripe, 100.000 stride 176,5 209
3  ORC SNAPPY 256MB stripe, 100.000 stride 254,2 285
4 PARQUET 16 buckets, sorted SNAPPY 346,5 16
5 PARQUET 32 buckets, sorted SNAPPY 346,5 32
6 PARQUET 64 buckets, sorted SNAPPY 346,5 64
7 ORC 16 buckets, sorted ZLIB 64MB stripe, 10.000 stride, bloom filter 171,6 16
8 ORC 32 buckets, sorted ZLIB 64MB stripe, 10.000 stride, bloom filter 171,6 32
9 ORC 64 buckets, sorted ZLIB 64MB stripe, 10.000 stride, bloom filter 171,6 64
10 ORC 16 buckets, sorted ZLIB 256MB stripe, 100.000 stride, bloom filter 171,5 16
11 SequenceFile SNAPPY 256MB block, BLOCK compression 529,6 601
12 SequenceFile BZIP2 256MB block, BLOCK compression 239,5 268
13 SequenceFile Deflate 256MB block, BLOCK compression 300,2 332
14 Avro SNAPPY 256MB block, BLOCK compression 481,8 545
15 Avro BZIP2 256MB block, BLOCK compression 343,9 387
16 Avro Deflate 256MB block, BLOCK compression 358.9 403
17 ORC partition by ID ZLIB 256MB stripe, 100.000 stride, bloom filter 208,9 6000
18 PARQUET partition by ID SNAPPY 346,6 6000
19 ORC partition by mod(ID, 100), sorted ZLIB 256MB stripe, 100.000 stride, bloom filter 176,7 100
20 ORC partition by mod(ID, 100) + 8 buckets, sorted ZLIB 256MB stripe, 100.000 stride, bloom filter 171,5 800
21 PARQUET partition by mod(ID, 100) + 8 buckets, sorted SNAPPY 352 800
  • SequenceFile could be a good candidate for partition by ID scenario, but BZIP is very expensive (4 times more time to generate test table in comparison to ORC + ZLIB).
  • ORC+ZLIB files partitioned by ID allows fast selection and have good compression.
  • ORC files partittioned by ID + bucketing, could be a  good compromise in number of files to selection time.
  • PARQUET+SNAPPY has worse compression as ORC + ZLIB. PARQUET does not support ZLIB or BZIP compression now.
  • Avro compression worse than by ORC or PARQUET.

Data analysis

Test – calculate daily average for 3 ID’s: select avg(value1) from table where id (id-1, id-2, id-3)

Select executed 3 times for TEZ and Spark engine (MapReduce was too slow, Impala not available on Hortonworks). Averagage time, number of workers and scanned records was constant by every execution. Slow (e.g. SequenceFile and Avro) tests for Spark are not executed (only Tez).

TEZ Spark (64 worker)
Nr Time (sec) # Tasks Records scanned Time (sec) # Tasks Records scanned
1 91 210 36.134.889 127 668
2 92 209 219.484.889 140 668
3 81 285 35.979.889 161 1003
4 84 398 539.167.342 82 2784
5 88 397 539.167.342 82 2784
6 93 392 2.134.662.574 79 2800
7 31 48 15.580.000 106 688
8 30 96 15.590.000 106 704
9 32 187 15.585.000 103 704
10 19 48 15.580.000 184 689
11 471 601 402 4334
12 1242 268 n/a
13 437 332 n/a
14 1512 545 n/a
15 3691 387 n/a
16 2671 403 n/a
17 17 3 15.552.000 6 4 15.552.000
18 Tez error 12 5 4 15.552.000
19 56 117 160.437.634 7 585 933.120.000
20 25 24 17.800.000 8 24 933.120.000
21 Tez error 113 6 97 933.120.000
  • Only partitioned and bucketed tables provide acceptable latency.
  • Spark not using ORC „predicate pushdown“ (SPARK-11087), but is faster than Tez (at least for small files).

Data selection

Test – select daily data from 3 ID’s: select * from table where id (id-1, id-2, id-3)

Only „interesting“ scenarios tested. Selects executed 3 times, average times present in table.

TEZ Spark (64 worker)
Nr Time (sec) # Tasks Records scanned Time (sec) # Tasks Records scanned
17 27 3 15.552.000 29 4 15.552.000
20 36 24 17.800.000 32 24 933.120.000


  • ORC file format with ZLIB compression algorithm provides best compression level.
  • SequeceFile and Avro file formats both have shown poor compression levels and selection times.
  • SNAPPY file format allows faster selection.
  • SNAPPY is better supported by Spark than by Tez.
  • Spark does not support Hive bucketing, and not using ORC predicate pushdown.
  • Spark is faster for analytical SQL queries (e.g. AVG)
  • Spark and Tez are comparable by pure data selection (scenario Nr 17), Spark suffers from high costs for job/data serialization and shuffling.
  • For „modulo partitioning“ (scenario Nr 20 – not supported by Hive), Spark provides better throughtput thank to better optimization and parallelization for big number of files.
Veröffentlicht in Allgemein, BigData, DWH

Hive vs Spak vs Impala

Hive 0.13 Spark 1.6 Impala 2.1
Support Hortonworks + Yahoo DataBricks + Yahoo Cloudera
Cluster Management YARN YARN, Mesos, local YARN (Llama)
Engine MR, Tez Spark impalad
Where are tables stored HDFS HDFS (through Hive Metastore).
Distributed shared object space + disk overruns.
Special storage implementations (e.g. CassandraSQLContext, HBaseSQLContext)
HDFS (through Hive Metastore).
Prefered storage format ORC Parquet Parquet
Joins Mostly on Disk (depending on engine) Memory Memory
Target Long running fault-tolerant queries. Adhoc queries.
Long running fault-tolerant queries.
Adhoc queries.
No fault-tolerance between queries.
Performance Depends on query type and engine.
Tez – Good
MR – Mediocre
Good – better than Tez for complex queries, through better opimizations and in-memory processing. Good – slower than Spark for due lack of cache and memory data store.
Latency Bad – seconds before query execution starts. Better on pre-warmed Tezcontainers. Average – overhead for job planning/distribution, RDD handling (serialization, shuffling etc.). Good – impalad deamon executing adhoc requests.
User defined functions supported supported work in progress
Integration of heterogeneous databases Supported through pluggable „Storage Handlers“ (e.g. any JDBC data source). Different SQL context objects in same Spark Context (e.g. CasanraSQLContext and HiveSQLContext).
Throught Hive + storage handlers.


Veröffentlicht in Allgemein, BigData

OWB to ODI migration

Oracle has finaliy presented an oficial OWB to ODI migration utility
and article describing typical migration path!

Alternative tools and ways for OWB to ODI migration:

  1. OWB2ODI Converter
  2. Export Metadata from OWB by OMBScript and generation of ODI XML files.
Veröffentlicht in DWH, Oracle

ETL tool comparison

Loading Type

Informatica: ETL, ELT with „Pushdown Optimization Option“
Pentaho: ETL
DataStage: ETL
Microsoft SSIS: ELT

System Plfatform

OWB: Windows, Unix
ODI: Windows, Unix
Informatica: Windows, Unix
Pentaho: Windows, Unix, Other Platforms (JDK)
DataStage: Windows, Unix, Other Platforms
Microsoft SSIS: Windows


OWB: Parallel steps in OWB flows (Mapping or subflows). Parallel SQL.
ODI: Parallel processes in scenarios.
Informatica: Partition Option (parallel threads). Asynchronous task steps in workflows.
Pentaho: Parallel job execution (asynchronous). Parallel mapping execution (e.g. parallel DML).
DataStage: Parallel jobs and parallel stages.
Microsoft SSIS: asynchronous data flows (tasks) in a package.


ODI: Load balancing for distributed agents executing etl scenarios.
Informatica: Informatica cluster. Public cloud version.
Pentaho: Cluster (carte server).
DataStage: MPP, cluster. AWS cloud.
Microsoft SSIS: SQL Server Cluster. Azure cloud.

ETL Proccess

OWB: jobs with processes(PL/SQL) and mappings (generated PL/SQL)
ODI: Scenarios containing processes (knowledge modules)
Pentaho: jobs with transformations
Informatica: sessions with workflows/worklets and tasks/tasklets
DataStage: Server job, DataStage job, Sequence Job (workflow)
Microsoft SSIS: package

Deployment facility

OWB: Yes (Configurations, Export/Import between repositories)
ODI: Yes
Informatica: Sessions with environment specific parameter files.
Pentaho: Export/Import XML files between environments. Environment properties (as file or db content).
DataStage: Information Server Manager – build, deploy. Transfer binaries (same platform). ODF and XML.
Microsoft SSIS: Yes

Engine based / code generator

OWB: PL/SQL packages executed by OWB repository.
ODI: Agents executing generated scenarios (Interpreter).
Informatica: Engine.
Pentaho: Engine interpreting XML jobs and transformations.
DataStage: Binaries (linked libraries).
Microsoft SSIS: SSIS engine executing packages.

Environment configuration

OWB: In repository of target environment.
ODI: ?
Informatica: Environment specific parameter files on session level. Multiple datasources/targets.
Pentaho: Environment properties (as file or db content). Any kind of JDBC connection.
DataStage: Configuration files (APT_CONFIG_FILE)
Microsoft SSIS: „Configurations“ in packages.

Numbers of available transformation functions

OWB: about 30
Informatica: 58
Pentaho: 60
DataStage: 70
Microsoft SSIS: 30

Slowly Changing Dimension

OWB: SCD Mappings (Enterprise ETL Option)
ODI: Yes
Informatica: Supports Full history, recent values, Current & Prev values.
Pentaho: Yes
DataStage: SCD „stage“ for Type 1 and 2
Microsoft SSIS: Yes, with a Wizard but difficult to modify

Rejected Records

OWB: Error tables
ODI: Some (oracle) „integration knowledge modules“ has rejection logic. „check knowledge modules“ for business check rejection suggested.
Informatica:. Reject (bad) files. Rejected records in logs.
Pentaho: Error handling data connections and validation nodes
DataStage: Yes
Microsoft SSIS: Can use conditional split transformation and error redirection to either CSV files or tables.

Debugging Facility

OWB: basic debugging, breakpoints
ODI: simulation mode
Informatica: session debugger, breakpoints, data movement, data analysis
Pentaho: mapping debugger, breakpoints, data movement, data analysis
DataStage: Supports basic debugging facilities for testing.
Microsoft SSIS: breakpoints, Data Viewer and error handling

Languages (extensibility)

ODI: Java, Jython, Java Bean Shell
Informatica: Java transformations.
Pentaho: Java, JavaScript
DataStage: Script, C++
Microsoft SSIS: C#, VB.NET (2005)

Application Integration Functionality

OWB: Difficult
ODI: JAX-WS, JMS connectors
Informatica: WebServices (XML, JSON), JMS, IBM MQ, Tibco Rendevous
Pentaho: JMS node, possible (e.g. java transformation nodes)
DataStage: WISD ingegration (web services), InfoSphere Streams, IBM MQ stages, Java/C++ API.
Microsoft SSIS: WebService Task, MSMQ integration. Any .NET functionality possible.

CDC and real time ETL

OWB: Not available
ODI: Oracle Golden Gate knowledge modules
Informatica: PowerExchange CDC
Pentaho: possible (e.g. java transformation nodes)
DataStage: IBM InfoSphere CDC. Always-on stages (e.g. ISD, MQClient or custom Java).
Microsoft SSIS: StreamInsight

Metadata: Ability to view & navigate metadata on the web

OWB: OWB Repository Browser
ODI: ODI Metadata Navigator (weblogic)
Informatica: Job sessions can be monitored using Informatica Classes
Pentaho: Pentaho Repository
DataStage: InfoSphere Metadata workbench. DataStage Operations Console for operational view (monitoring, statistics, performance)
Microsoft SSIS: Does not support

Ability to Customize views of metadata for different users

ODI: User roles (repository)
Informatica: Supports.
Pentaho: User rights (read,write,access) on jobs and transformations
DataStage: User roles
Microsoft SSIS: Does not support

Metadata repository can be stored in RDBMS

OWB: Yes (only)
ODI: Yes (only)
Informatica: Yes
Pentaho: Yes (XML, RDBMS, Enterprise Repository)
DataStage: No. But the proprietary meta data can be moved to a RDBMS using the DOC Tool
Microsoft SSIS: Does not support

Metadata API

OWB: OBMScript. Oracle repository tables.
ODI: Java API. DB repository tables.
Pentaho: XML files. DB repository tables (DB repository). JSR-283 (Enterprise repository).
Microsoft SSIS:

Command line operation

ODI: Yes (start, stop, sleep and many others)
Informatica: Yes (Pmcmd – start, stop, abort or schedule workflow )
Pentaho: Start/Stop of job and transformations (pan and kitchen)
DataStage: Start/Stop, compile, export/import metadata (many different commands)
Microsoft SSIS: Yes (Dtexec.exe)

Ability to maintain versions of mappings

OWB: Very Difficult (through export/import with OMBScript on object level possible).
ODI: Yes. Internal version management on object level.
Informatica: Yes (Additional licence needed). SVN throught XML export/import possible.
Pentaho: SVN (for XML repositories) or in basic locking/versioning on object level in „Enterprise repository“
DataStage: CSV and ClearCase support.
Microsoft SSIS: Difficult (binary dtsx files).

Job Controlling & Scheduling

OWB: No (external scheduler or Oracle scheduler needed)
ODI: Yes
Informatica: Very good
Pentaho: only for „Enterprise repository“
DataStage: Does not support directly (no option). But possible to call custom programs after the job get executed.
Microsoft SSIS: Job Agent, Email Notification


OWB: Average. Slow learning curve. Bugs (out of support). Slow UI.
ODI: Average. Slow learing curve. Own ETL tooling vision.
Pentaho: Very good. Steep learning curve. Less mature than competitors (e.g. weak logging and monitoring). Good UI. No installation needed.
Informatica: Very good. Steep learning curve. Clean and intuitive UI.
Microsoft SSIS: Good. Steap learning curve. VisualStudio as UI. C# knowledges needed.


OWB: Included in Oracle DB. Additional options could be necessary (Partitioning, Enterprise ETL, etc.). About 15.000 core/year.
ODI: Per Core or Named User. About 30.000 per core/year
Pentaho: Free. Support costs about 30.000 per CPU/year.
Informatica: Per Server. Additional options could be necessary (e.g. „Pushdown Optimization“). Expensive and intransparent.
DataStage: Per Processor (Processor Value Unit)
Microsoft SSIS: Part of SQL Server license (CAL or Core).

Veröffentlicht in DWH

IBM Message Broker logging overview

Provided from IBM Message Broker platform

  • Rudimentary Activity-Log (Start/End Broker, deployments etc. Operational perspective)
  • Standart error logging (flow or transaction fails, queue full etc.)
  • Logging in Flow (LOG Command in ESQL, Logging from Java Compute Node)
  • Trace Node
  • Monitoring events


  • Log4j
  • File Output Node
  • Logging into Database
  • Logging into MQ Queue

Logging typeTargetControlContent

Activity Log Syslog No Operational info
Standart error log Syslog No Operational info (unhandled exceptions)
Logging in flow Syslog Output from ESQL (LOG statement) and Java ComputeNode (e.g. MbService.logInformation) Any. single line of text.
Trace node One of: local error log (syslog), user trace (binary), file. Enable/Disable trace. Message from message flow and additional predefined message from message catalog.
Monitoring events MQ Topic Enable/Disable events. Only method for logging BEFORE flow begins (transaction start). Only method for log transactions. Anything in $Root, $Enviroment, $LocalEnviroment, $Properties, $DestinationList.
Log4j Any log4j destination Only from Java ComputeNode Any
FileOutputNode Any file path Per dynamic routing possible. Anything in $Root, $Enviroment, $LocalEnviroment, $Properties, $DestinationList.
Database logging Database From ESQL or Java Compute Node Any
Logging into Queue Queue MQ Output node in flow needed. Online Enable/Disable per dynamic routing possible. Message (But any message can be constructed by preceding node).

Local Enviroment (Dynamic Routing)
Message Broker Logs
Message Broker Trace
Message Flow Monitoring

Veröffentlicht in IBM Message Broker

IBM Message Broker automatic deployment (windows)

IBM Message Broker folgt der Unix Ideologie und hat eine Reihe von Konfiguration/Administration CommandLine Tools.
Im Unix Style, werden diese Tools durch Shell Scripte erweitert und automatisiert.
Im Windows kann das aber zum Problem werden.
Zum Beispiel ist die Einbidung von IBM Message Broker ins CI Enviroment (Jenkins, Hudson, CrusieControl etc.) ein solches Problem.

Ein BAR File (ein Message Broker Deployment Archive) wird auf Windows mittels „mqsideploybar.exe“ Commando deployed.
Das Programm geht von einem Enviroment aus, das durch Asuführung von „mqsiprofile.cmd“ vorbereitet wird.
Die Ausführung von CMD Dateiein im Windows öffnet ein neues Shell und terminiert nicht.
Dadurch könnnen zwei oben nacheinander follgende Aufrufe im ANT nicht sequentiell ausgeführt werden (ANT ist immer noch ein sehr verbreiteter Deployment Instrument für komplexe Enviroments).

Um das Problem unzugehen, braucht man eine kleine BAT Datei, die den „mqsiprofile.cmd“ mit „call“ Aufruf „sourced“.

rem %1 - broker name
rem %2 - execution group name
rem %3 - bar file

call mqsiprofile.cmd
mqsideploy.exe %1 -e %2 -a %3 -m

Der ANT Aufruf sieht dann follgendermasen aus:

<exec executable="deploy.bat" spawn="false" resolveexecutable="true" vmlauncher="false">
  <arg value="${broker}" />
  <arg value="${executiongroup}" />
  <arg value="${barfile}" />


Veröffentlicht in IBM Message Broker

Oracle ODI Interface-Sprache

Oracle ODI erkennt (wie jede andere gute Java Anwendung) die Sprache von der Betriebsystem-Umgebung.

Aus unterschiedlichen Gründen (multinationales Entwickler-Team, schlecte Übersetzung, englische Dokumentation oder Tutorials) möchte man jedoch eine andere IDE Sprache festlegen. Um die Sprache in der aktuellen ODI Version zu ändern (, muss man entsprechende Language Parameter an die Java-Virtual Machine übergeben.

Das ODI Client liest beim Start follgende Konfigurations-Datei [ODI_HOME]\oracledi\client\ide\bin\ide.config
Man muss also nur zwei zusätzliche Zeilen am Ende dieser Datei hinzufügen, um z.B. den US Locale zu setzen:

AddVMOption -Duser.language=en

Oracle Data Integrator

Veröffentlicht in DWH, Oracle

Java Hibernate + Oracle. Performance optimization.

Man kann grundsätzlich auf 3 verschiedenen Ebenen das Programm optimieren:

  1. Datenmodell Ebene
  2. Oracle Tuning (Indexes, Hints, Partitioning, Hardware und Betriebssystem Optimierung)
  3. Hibernate Tuning (fetch mode, caching, programm logic)

Das Tuning auf dem Datenmodell Ebene ist eher unwahrscheinlich, wird meistens nur bei besonders hartnäckigen Problemen gemacht, in meisten Fällen wird dabei eine Denormalisierung von Daten durchgeführt. Optimierungen auf Datenmodell Ebene werden oft noch bei der Architektur-Phase gemacht.
Das Tuning auf Oracle Seite ist sehr nützlich und muss auf jeden Fall bei Performance Problemen gemacht werden.
Das Hibernate Tuning ist meist flexibles aber gleichzeitig meist komplexes Prozess.

Um eine effektive Optimierung machen zu können, muss zuerst das Performance-Problem lokalisiert werden. Das erreicht man durch Generierung von realitätsnahen Workloads auf der Applikation, mit gleichzeitigem Logging von allen Hibernate+Datenbank relevanten Events (Caching, SQL’s, timings).


Hibernate hat einen eingebauten Mechanismus zum loggen von SQL Statements auf Java Konsole, das mit einem Konfigurations-Parameter eingeschaltet werden kann (Spring Beispiel)

<bean id="exampleHibernateProperties" class="org.springframework.beans.factory.config.PropertiesFactoryBean">
 <property name="properties">
   <prop key="hibernate.show_sql">true</prop>  <-- show_sql statements on console -->
   <prop key="hibernate.use_sql_comments">true</prop>  <-- generate additional comments for every statement -->
   <prop key="hibernate.format_sql">true</prop>  <-- format_sql_statements -->
   <prop key="hibernate.generate_statistics">true</prop>  <-- generate statistics for sql statements -->

Alternativer Weg ist das Logging von SQL Statements mit log4j framework.

# Log all SQL Statements
<logger name="org.hibernate.SQL">
    <level value="DEBUG"/>
# Logs SQL statements for id generation
<logger name="">
    <level value="INFO"/>
# Logs the JDBC-Parameter which are passed to a query (very verboose)
<logger name="org.hibernate.type">
    <level value="DEBUG"/>
# Logs cache related activities
<logger name="org.hibernate.cache">
    <level value="DEBUG"/>

Nächste Alternative wäre die Verwendung von speziellen JDBC logging frameworks wie log4jdbc oder P6Spy, die wesentlich detailliertere Informationen über JDBC Aktivitäten liefern können, wie z.B. „timings“, „result sets“, connection statistics etc. Solche frameworks werden als Proxy für DataSources oder Connection eingeschaltet, was sich leicht im Java Code, oder deklarative (Spring) implementieren lässt.

Noch ein Weg, um Übersicht über Hibernate Aktivitäten zu bekommen, ist die Verwendung von Hibernate Statistiken.
Die Sammlung von Statistiken wird mit „generate_statistics“ Flag eingeschaltet, danach kann man (in regelmäßigen Abständen) die Metriken auf Konsole oder in JMX publizieren. Hier ein Beispiel der JMX Publikation mit Hilfe von Spring:

<bean id="mbeanServer" class="">
	<property name="locateExistingServerIfPossible" value="true"/>
<bean id="jmxExporter" class="org.springframework.jmx.export.MBeanExporter">
	<property name="server" ref="mbeanServer"/>
	<property name="beans">
			<entry key="Hibernate:application=Statistics" value-ref="jmxStatisticsBean"/>
<bean id="jmxStatisticsBean" class="org.hibernate.jmx.StatisticsService">
	<property name="statisticsEnabled" value="true"/>
	<property name="sessionFactory" ref="ganesha.sessionFactory"/>

Es gibt mehrere Tools für Visualisierung von Hibernate JMX Statistiken, ich kann z.B. Hibernate-JConsole Tool empfehlen, das eine sehr übersichtliche Darstellung von Metriken liefert. Hibernate-jconsole ist ein JConsole/VisualVM plugin, das beim Ausführen aber unbedingt das Hibernate JAR braucht.
Beispiel der Verwendung: java -jar hibernate-jconsole-1.0.4.jar -J-Dhibernate.classpath=/home/user/hibernate-core-3.6.2.Final.jar

Hibernate Statistiken haben in der jetzigen Version zwei kleine Macken. Erstens – werden die SQL Abfragen die mit Criteria API generiert wurden nicht berücksichtigt (es gibt dazu ein Bug/Workarround. Zweitens – werden keine impliziten Queries in die Statistiken eingenommen (z.B. Lazzy-Load Queries oder Queries die durch FetchMode.SELECT generiert wurden). Um auch die implizite Queries analysieren zu können (und überhaupt sehen zu können), werden SQL logging von Hibernate oder spezielle JDBC logging frameworks benötigt.
Ein weiteres Problem von Hibernate-Statistiken, ist die Detaillierung von Cache Information. Da Hibernate sich von Caching-Technologie abstrahiert, kann es nur rudimentäre Informationen darüber liefern wie z.B. (Hits, Misses, Puts). Um mehr Daten über die Caches zu sehen, werden spezifische Tools verwendet abhängig von eingesetztem Cache-Provider. Zum Beispiel, für EHCache gibt es auch ein JMX Monitoring Mechanismus (kann parallel zum Hibernate-Statistik verwendet werden), das zusätzliche Informationen liefert wie: Anzahl Objekte im Speicher und auf der Festplatte, Speicherverbrauch, Evicts.

Ein komplett anderer Logging-Ansatz wäre die Nutzung von Oracle Datenbank-Tools (trace, „Top Statements“ in Enterprise Manger, AWR). Solche Instrumenten erfordern leider oft Administrations-Rechte, was ein üblicher Java-Developer selten bekommt.

Oracle Tuning

Auf der Oracle Seite sollte man sich als Erstes den Execution Plan von SQL Statements anschauen. Das häufigste und einfachste Performance-Problem, sind die fehlenden Indexe. Oracle hat allerdings viele weitere Optimierungs-Möglichkeiten, die aber nur durch einen erfahrenen Oracle Spezialist geplant und implementiert werden sollen:

  • Caching
  • Partitionierung
  • SQL Performance Analyzer Tool
  • SQL Profiles (static execution plans)
  • Optimierungen auf Hardware und Betriebsystem-Ebene

Oracle hat interne Optimierungs-Mechanismen (Statistiken, Optimizer, Caches), die dynamisch versuchen den Durchsatz von der Datenbank zu verbessern. Aus diesem Grund, beim Performance-Tests, müssen folgende Bedingungen maximal erfühlt werden:

  • Hardware und Software vom Oracle Test-Server soll identisch zum Produktiv-Server sein.
  • Der Datenbank-Inhalt soll identisch zum Produktive-Server sein.
  • Auf dem Test-Server während des Tests müssen gleiche Prozesse laufen, wie auf dem Produktive Server.

Idealerweise sollte man direkt auf Produktive-Server testen, was leider nur sehr selten möglich ist.
Alternative erzeugt man eine vollständige Kopie von Produktive-System (z.B. mit FileSystem snapshots, oder RMAN). Mit Hilfe von realitätsnahen Test-Workloads, oder speziellen Mechanismen zum Workload-Simulation, wie Oracle Database Replay, wird auf dem Test-System das Produktive-Workload simuliert. Erst dann kann man mit den Tests und Optimierungen beginnen, um die beste Ergebnisse zu bekommen (bzw. überhaupt das Performance-Problem zu lokalisieren).

Hibernate Tuning

Im Hibernate gibt es grundsätzlich zwei Optimierungs-Möglichkeiten:

  1. Optimierung von Assoziationen (Eager vs Lazy loading, FetchMode, BatchSize etc.)
  2. Caching (Query und Second-Level caching)

Mit Caching, versucht man die Datenbank Abfragen vollständig zu vermeiden.
Hibernate hat zwei Ebenen von Entity Caches – First Level Cache (gehört zur Hibernate Session) und ein Second-Level-Cache (wird von mehreren Hibernate Sessions zusammenverwendet). Mit dem Second-Level-Cache (und zusätzlich einem Query-Cache) erreicht man den maximalem Performance-Schub. In dem First und Second-Level-Caches befinden sich die Entities. Das Query-Cache enthält die Ergebnisse von Query Abfragen, und zwar nur die ID’s von den Entities, die von der Abfrage gelesen wurden, die Entities selbst werden in dem Second-Level-Cache gehalten. Dadurch ist ein Query-Cache ohne aktiviertem Second-Level-Cache sinnlos!
Beim Konfigurieren von Caches muss man 3 Sachen beachten:

  1. Cache Provider und seine Features wie Speicher-Typ (Memory, Festplatte, Cluster), Unterstützung für Concurency-Strategien und Query-Cache
  2. Concurency-Strategy für jeden einzelnen Entity-Typ (read-only, nonstrict-read-write, read-write, transactional)
  3. Gute Cache-Kandidaten – nicht alle Entities eignen sich für das Caching

Für unveränderliche Objekte kann Caching nahezu sorglos definiert werden, beim veränderlichen Objekten muss man Eigenschaften betrachten wie: Entity-Speicherverbrauch, Verhältnis zwischen Lese/Schreib-Operationen, Concurency und Invalidierungs-Strategie.
Objekte in First und Second-Level Caches werden automatisch invalidiert, sobald ein Entity geändert wird. Für verteilte Anwendungen, mit Caching-Provider ohne Cluster-Unterstützung, muss man sich um die Invalidierung von Objekten selbst kümmern, dafür gibt es im Hibernate spezielle „evict“ Funktionen am SessionFactory für einzelne Entities und ganze Cache-Regionen invalidieren.
Beim Aktivieren von Query-Cache, wird gleichzeitig noch ein zusätzlicher UpdateTimestampsRegion erstellt, in dem, die Zeiten der letzten Änderung einer (für Hibernate bekannter) Datenbank Tabelle gespeichert werden. Jeder QueryCahe kennt die Liste aller Tabellen, die in der Query verwendet wurden. Sobald Hibernate merkt, das eine der Tabellen in der Zwischenzeit geändert wurde, wird der gesamter Query-Cache Region invalidiert.

Beim Optimieren von Assoziationen geht es darum, eine möglichst effiziente Balance zwischen der Anzahl und Dauer von SQL Abfragen zu finden (Ausführungsdauer und Anzahl/Grösse der gelesenen Datensätze). Generell ist Lazy loading empfohlen (ist standard FetchMode in Hibernate), man kann das aber nicht pauschal für alle Abfragen verwenden. Eager loading und andere Strategien können in bestimmten Situationen bessere Ergebnisse liefern.
Typische Problemen sind:

  • N+1 Problem – kann mit FetchMode.JOIN, FetchMode.SELECT, FetchMode.SUBSELECT und BatchSize verbessert werden
  • Monster SQL (kartesischer Produkt bei Eager loading – zu viele JOIN’s + ein sehr langer ResultSet + Ausführungsdauer) – man vermeidet das mit FetchModel.SELECT oder BatchSize bzw. Lazzy loading
  • Zu viele Abfragen – suboptimale Applikation oder DAO Logik
  • Verwendung von Native SQL mit Parametern in SQL Code – stattdessen sollte man Parameter binding verwenden


Performance tuning tips for Hibernate and Java Persistence
Trully understanding the second level and query caches
Hibernate Improving performance
Hibernate Fetching Strategies examples


Oracle Performance Tuning Steps
Oracle Database Performance Tuning Guide

Veröffentlicht in Java