Optimal Strategies for Large-Scale Batch ETL Jobs - Databricks

Optimal Strategies for Large-Scale Batch ETL Jobs

Download Slides

The ad tech industry processes large volumes of pixel and server-to-server data for each online user’s click, impression, and conversion data. At Neustar, we process 10+ billion events per day, and all of our events are fed through a number of Spark ETL batch jobs. Many of our Spark jobs process over 100 terabytes of data per run, each job runs to completion in around 3.5 hours. This means we needed to optimize our jobs in specific ways to achieve massive parallelization while keeping the memory usage (and cost) as low as possible. Our talk is focused on strategies dealing with extremely large data. We will talk about the things we learned and the mistakes we made. This includes: – Optimizing memory usage using Ganglia – Optimizing partition counts for different types of stages and effective joins – Counterintuitive strategies for materializing data to maximize efficiency – Spark default settings specific to large scale jobs, and how they matter – Running Spark using Amazon EMR with more than 3200 cores – Review different types of errors and stack traces that occur with large-scale jobs and how to read and handle them – How to deal with large number of map output status when there are 100k partitions joining with 100k partitions – How to prevent serialization buffer overflow as well as map out status buffer overflow. This can easily happen when data is extremely large – How to effectively use partitioners to combine stages and minimize shuffle.
Session hashtag: #EUdev3

About Emma Tang

Emma is a lead software engineer at Neustar. She drives the migration to Spark as a part of the data processing pipeline at Neustar. She received her masters from the University of Oxford, and her Bachelor's in Operations Research and Financial Engineering from Princeton. Emma is passionate about increasing the representation of women in technology, especially in leadership positions.