Migrating Complex Data Aggregation from Hadoop to Spark

Download Slides

This talk discusses our experience of moving from Hadoop MR to Spark. Our initial implementation used a multiple stage aggregation framework within Hadoop MR to join, de-dupe, and group 12TB of incoming data every 3 hours. There was an additional requirement to join other heterogeneous data sources along with implementation of algorithms like HyperLogLog. The Hadoop MR Cluster size and cost exponentially increased with scale for these use-cases so we evaluated Spark. A Spark 1.1 cluster ran the aggregation pipeline with 60% fewer nodes and 30% cost savings as compared to Hadoop MR within the SLA limits. The HyperLogLog implementation ran 5-8X faster than Hadoop MR on the same number of nodes. Optimizations and tuning were necessary on serialization (Kyro), parallelism, partitioning, compression, batch size, and memory assignments. The Spark extension provided by Cascading was used to migrate the code to Spark.