Hive vs Spak vs Impala

Hive 0.13 Spark 1.6 Impala 2.1
Support Hortonworks + Yahoo DataBricks + Yahoo Cloudera
Cluster Management YARN YARN, Mesos, local YARN (Llama)
Engine MR, Tez Spark impalad
Where are tables stored HDFS HDFS (through Hive Metastore).
Distributed shared object space + disk overruns.
Special storage implementations (e.g. CassandraSQLContext, HBaseSQLContext)
HDFS (through Hive Metastore).
Prefered storage format ORC Parquet Parquet
Joins Mostly on Disk (depending on engine) Memory Memory
Target Long running fault-tolerant queries. Adhoc queries.
Long running fault-tolerant queries.
Adhoc queries.
No fault-tolerance between queries.
Performance Depends on query type and engine.
Tez – Good
MR – Mediocre
Good – better than Tez for complex queries, through better opimizations and in-memory processing. Good – slower than Spark for due lack of cache and memory data store.
Latency Bad – seconds before query execution starts. Better on pre-warmed Tezcontainers. Average – overhead for job planning/distribution, RDD handling (serialization, shuffling etc.). Good – impalad deamon executing adhoc requests.
User defined functions supported supported work in progress
Integration of heterogeneous databases Supported through pluggable „Storage Handlers“ (e.g. any JDBC data source). Different SQL context objects in same Spark Context (e.g. CasanraSQLContext and HiveSQLContext).
Throught Hive + storage handlers.


Veröffentlicht in Allgemein, BigData