Hadoop stream processing

Apache Storm Spark Apache Samza Apache Flink Apache Apex
Developer Hortonworks (Twitter) Databricks  LinkedIn dataArtisans DataTorrent
Computation model Storm – streaming
Trident – micro-batching
Micro-batching Streaming Streaming or batching Streaming with time boundaries
API Storm – programmatic
Trident – declarative
declarative programmatic declarative declarative
Resource Management Nimbus + Storm Supervisor YARN, Mesos, Standalone  YARN  YARN, Standalone  YARN
Parallelism OS process per task Configurable Multiple tasks in one thread DAG optimizer, configurable
Language support Java, Python Java, Scala, Python Java, Scala Java, Scala  Java, Scala
Internal Messaging Netty Netty + Akka buffer to disk ?  intra-process or disk-buffer
Back-pressure Limiting max messages
in topology at any time
+  relies on kaffka through buffer handover  automatic, buffering
Interactive processing over DRPC from spark 2.0  missing missing missing
 
Design Topology  Spark job (DAG) Samza jobs Flink job (DAG)
Message source/target  Spout Receivers Stream consumer  Source  Input/Output Operator
Message processing Bolt Transformations + Operations Task Operator => Task  Compute operator
Stream primitives Tuple DStream, RDD  Message  DataStream  Tuple
 
Kafka integration Kafka Spout  Receiver approach + DirectStream  thight  Kafka consumer  Kafka input
At most once Storm + auto ACK.
In order processing.
Receiver without checkpointing
In order processing.
not needed not used Stateless operators.
In order processing
At least once Storm + ACK on success.
Out of order.
Receiver with checkpointing.
In order processing.
In order processing
(within partition)
Out of order processing Out of order processing
Exactly once Only in Trident mode.
In order processing.
with WAL enabled
In orderprocessing.
missing In order processing In order processing
Kafka offset Zookeeper WAL Log Zookeeper Zookeeper Zookeeper or OffsetManager
Manipulating checkpoints possible not possible utility possible possible
 
State management Only in Trident mode WAL logs.
Disabled per default.
Replicated changelog Distributed snapshotting.
Disabled per default.
Operator Snapshot.
Enabled per default. Suppresable.
State storage Plugable, API.
JDBC, FileSystem, etc
FileSystem (HDFS, S3) Plugable, API.
LevelDB – Key/Value
FileSystem
(HDFS, In-Memory)
HDFS
State behavior Cached state batches At end of the micro-batch Task local Time-Periodical
Chandy Lamport
at window boundary
 
Join streams code or trident-api + code + +
Merge streams + + + + +
Partitioning by grouping + code + +
Time based windows Storm – TickTuple
Trident – batch
+ yes + +
Count based windows code or trident-api hard code + possible
Sliding windows code or trident-api time code time and count based time
 
Latency Storm – Microseconds
Trident – subsecond
Subsecond Milliseconds
(disk buffering)
Milliseconds Milliseconds
Throughput Medium High Medium to High High High
Impact of fault Medium to High High Low High Low
Scalability horizontal (ack limit), rebalancing sheduler limit horizontal horizontal horizontal, dynamic

 

Storm:

  • There is a storm-yarn wrapper implementation from Yahoo, which allows to start storm cluster in YARN containers, but is not a default execution scenario.
  • A bolt can maintain in-memory state (which is lost if that bolt dies), or it can make calls to a remote database to read and write state. However, a topology can usually process messages at a much higher rate than calls to a remote storage can be made, so making a remote call for each message quickly becomes a bottleneck.
  • Upstream bolt/spout failure triggers re-compute on entire tree. Failed tuples cannot be replayed quickly. Meets latency and throughput requirements only when no failure occurs.
  • Trident provides high level API for data manipulation like filtering, join and aggregation, but Trident implements this using standard Storm building blocks. Same logic could be implemented manually.

Spark:

  • The same code can be used for batch and stream processing.
  • Many tools and integration possibilities with other systems: GraphX, MLib, SQL, Atlas Notebooks
  • Spark is storing an internal state in proprietary format (seems to be a serialized Java object graph), which is making upgrade of spark application more difficult (a graceful shutdown with empty state needed).
  • There is a possibility to set „maxRate“ or „maxRatePerPartition“ for Kafka streams, limiting max number of messages processed per micro-batch.

Samza:

  • By co-locating storage and processing on the same machine, Samza is able to achieve very high throughput, even when there is a large amount of state.
  • Every input and output stream in Samza is fault-tolerant and replicated (buffering of messages between StreamTask’s) which makes „at most once“ processing obsolete – messages will be always processed.
  • Single threaded parallelism (container executes only one thread, but may contain multiple tasks) maps one thread to exactly one CPU which provides more predictable resource model and reduces interference from other tasks running on the same machine.
  • Different Samza jobs can communicate with each other over named streams, doing it more EAI like engine.

Flink:

  • Broader toolset than Storm or Apex – ML, batch processing, SQL like queries.
  • Failures reset all operators back to the source; throughput and latency requirements can be met only if no errors occurs.

Apex:

  • Outstanding latency and throughput even in failure scenarios.
  • Apex is Spark on steroids, using all the advantages of micro-batching (executing sequences of batch/windows in semi-parallel fashion) and avoid Spark drawbacks (no sheduling, no shuffling and serialization/deserialization in THREAD_LOCAL, CONTAINER_LOCAL or NODE_LOCAL topologies).

General notes:

  • Micro-batching limits the latency of jobs to that of the micro-batch. While sub-second batch latency is feasible for simple applications, applications with multiple network shuffles easily bring the latency up to multiple seconds.
  • Micro-batch architectures that use time-based batches have an inherent problem with backpressure effects. If processing of a micro-batch takes longer in downstream operations (e.g., due to a compute-intensive operator, or a slow sink) than in the batching operator (typically the source), the micro batch will take longer than configured. This leads either to more and more batches queueing up, or to a growing micro-batch size.
  • All well-designed streaming systems (not micro-batching) are not processing record-at-a-time but buffer many records before shipping them over the network (using messaging systems like Netty), while having continuous operators.

Links:
Apache streaming comparison
Benchmarking Streaming Computation Engines at Yahoo!
The world beyond batch: Streaming 101
The world beyond batch: Streaming 102
Apache Beam

Veröffentlicht in Allgemein, BigData, Java, Messaging