Apache Spark claims to outperform Hadoop by a huge margin for iterative machine learning workloads. But how does Spark on YARN perform for ETL jobs in a multitenant environment with hundreds of users, thousands of nodes, and a across a petabyte scale data warehouse? This talk will cover the technical aspects of our journey at Netflix to productionize spark for ETL. The Big Data Platform team at Netflix maintains the compute resources, infrastructure and a cloud-based data warehouse with over 25 petabytes of data stored on Amazon S3 predominantly in Parquet format. In this presentation, we explore our deployment and challenges running Spark alongside traditional YARN workloads . We cover how spark fits into our big data ecosystem, multi-tenancy, and experience porting ETL jobs from mapreduce to spark. In addition, we dive into optimizations like s3 bulk listing and output committer, parquet optimizations through dictionary pushdown, and continuous processing patterns. Then we’ll take a look at our on-demand zeppelin and pyspark notebooks which run within docker containers in the cloud. Finally we will conclude by presenting use cases, performance benchmarks, our contributions and roadmap.
Ashwin Shankar is an Apache Hadoop and Spark contributor. He is a Senior Software Engineer at Netflix and is passionate about developing features and debugging problems in large scale distributed systems. Ashwin holds a Master's degree in Computer Science from University of Illinois at Urbana Champaign.