Hadoop file format comparison

Use case and environment

IoT datalake use case.
6000 devices (with unique ID), measuring 3 values 60 time per second (60 Herz).
One day of data (24 hours) – 31.104.000.000 records in database.
Row in a table – [ID:int, timestamp:long, value1:float, value2:float, value3:long]

Hortonworks 2.4 Hadoop cluster – 16 DataNodes with 24 VCore each, 256 GB Memory
Hive 0.13, Spark 1.6, Tez

Goal

Find optimal table layout and file format to optimize:
1) Compression
2) SQL latency for data selection (few ID’s in one select).

Layout scenarios

Assumption – partitioning of table by ID (one subfolder for every ID value) or bucketing (same ID saved in one file, number of files = number of buckets) allows faster selection of data from particullar ID.

Nr File type Partitioning/bucketing Compr. Options Table size (GB) Files
 1 ORC ZLIB 256MB stripe, 10.000 stride 176,7 210
2 ORC ZLIB 256MB stripe, 100.000 stride 176,5 209
3  ORC SNAPPY 256MB stripe, 100.000 stride 254,2 285
4 PARQUET 16 buckets, sorted SNAPPY 346,5 16
5 PARQUET 32 buckets, sorted SNAPPY 346,5 32
6 PARQUET 64 buckets, sorted SNAPPY 346,5 64
7 ORC 16 buckets, sorted ZLIB 64MB stripe, 10.000 stride, bloom filter 171,6 16
8 ORC 32 buckets, sorted ZLIB 64MB stripe, 10.000 stride, bloom filter 171,6 32
9 ORC 64 buckets, sorted ZLIB 64MB stripe, 10.000 stride, bloom filter 171,6 64
10 ORC 16 buckets, sorted ZLIB 256MB stripe, 100.000 stride, bloom filter 171,5 16
11 SequenceFile SNAPPY 256MB block, BLOCK compression 529,6 601
12 SequenceFile BZIP2 256MB block, BLOCK compression 239,5 268
13 SequenceFile Deflate 256MB block, BLOCK compression 300,2 332
14 Avro SNAPPY 256MB block, BLOCK compression 481,8 545
15 Avro BZIP2 256MB block, BLOCK compression 343,9 387
16 Avro Deflate 256MB block, BLOCK compression 358.9 403
17 ORC partition by ID ZLIB 256MB stripe, 100.000 stride, bloom filter 208,9 6000
18 PARQUET partition by ID SNAPPY 346,6 6000
19 ORC partition by mod(ID, 100), sorted ZLIB 256MB stripe, 100.000 stride, bloom filter 176,7 100
20 ORC partition by mod(ID, 100) + 8 buckets, sorted ZLIB 256MB stripe, 100.000 stride, bloom filter 171,5 800
21 PARQUET partition by mod(ID, 100) + 8 buckets, sorted SNAPPY 352 800
  • SequenceFile could be a good candidate for partition by ID scenario, but BZIP is very expensive (4 times more time to generate test table in comparison to ORC + ZLIB).
  • ORC+ZLIB files partitioned by ID allows fast selection and have good compression.
  • ORC files partittioned by ID + bucketing, could be a  good compromise in number of files to selection time.
  • PARQUET+SNAPPY has worse compression as ORC + ZLIB. PARQUET does not support ZLIB or BZIP compression now.
  • Avro compression worse than by ORC or PARQUET.

Data analysis

Test – calculate daily average for 3 ID’s: select avg(value1) from table where id (id-1, id-2, id-3)

Select executed 3 times for TEZ and Spark engine (MapReduce was too slow, Impala not available on Hortonworks). Averagage time, number of workers and scanned records was constant by every execution. Slow (e.g. SequenceFile and Avro) tests for Spark are not executed (only Tez).

TEZ Spark (64 worker)
Nr Time (sec) # Tasks Records scanned Time (sec) # Tasks Records scanned
1 91 210 36.134.889 127 668 31.104.000.000
2 92 209 219.484.889 140 668 31.104.000.000
3 81 285 35.979.889 161 1003 31.104.000.000
4 84 398 539.167.342 82 2784 31.104.000.000
5 88 397 539.167.342 82 2784 31.104.000.000
6 93 392 2.134.662.574 79 2800 31.104.000.000
7 31 48 15.580.000 106 688 31.104.000.000
8 30 96 15.590.000 106 704 31.104.000.000
9 32 187 15.585.000 103 704 31.104.000.000
10 19 48 15.580.000 184 689 31.104.000.000
11 471 601 31.104.000.000 402 4334 31.104.000.000
12 1242 268 31.104.000.000 n/a
13 437 332 31.104.000.000 n/a
14 1512 545 31.104.000.000 n/a
15 3691 387 31.104.000.000 n/a
16 2671 403 31.104.000.000 n/a
17 17 3 15.552.000 6 4 15.552.000
18 Tez error 12 5 4 15.552.000
19 56 117 160.437.634 7 585 933.120.000
20 25 24 17.800.000 8 24 933.120.000
21 Tez error 113 6 97 933.120.000
  • Only partitioned and bucketed tables provide acceptable latency.
  • Spark not using ORC „predicate pushdown“ (SPARK-11087), but is faster than Tez (at least for small files).

Data selection

Test – select daily data from 3 ID’s: select * from table where id (id-1, id-2, id-3)

Only „interesting“ scenarios tested. Selects executed 3 times, average times present in table.

TEZ Spark (64 worker)
Nr Time (sec) # Tasks Records scanned Time (sec) # Tasks Records scanned
17 27 3 15.552.000 29 4 15.552.000
20 36 24 17.800.000 32 24 933.120.000

Conclusions

  • ORC file format with ZLIB compression algorithm provides best compression level.
  • SequeceFile and Avro file formats both have shown poor compression levels and selection times.
  • SNAPPY file format allows faster selection.
  • SNAPPY is better supported by Spark than by Tez.
  • Spark does not support Hive bucketing, and not using ORC predicate pushdown.
  • Spark is faster for analytical SQL queries (e.g. AVG)
  • Spark and Tez are comparable by pure data selection (scenario Nr 17), Spark suffers from high costs for job/data serialization and shuffling.
  • For „modulo partitioning“ (scenario Nr 20 – not supported by Hive), Spark provides better throughtput thank to better optimization and parallelization for big number of files.
Veröffentlicht in Allgemein, BigData, DWH