Jianneng Li is a Software Development Engineer at Workday (and previously with Platfora, acquired by Workday in late 2016). He works on Prism Analytics, an end-to-end data analytics product, part of the Workday ecosystem, which helps businesses better understand their financials and HR data. Being a part of Spark team in Analytics org, Jianneng specializes in distributed systems and data processing. He enjoys diving into Spark internals and published several blog posts about Apache Spark and analytics. Jianneng holds a Master’s degree in EECS from UC Berkeley and a Bachelor’s degree in CS from Cornell University.
Broadcast join is an important part of Spark SQL's execution engine. When used, it performs a join on two relations by first broadcasting the smaller one to all Spark executors, then evaluating the join criteria with each executor's partitions of the other relation. When the broadcasted relation is small enough, broadcast joins are fast, as they require minimal data shuffling. Above a certain threshold however, broadcast joins tend to be less reliable or performant than shuffle-based join algorithms, due to bottlenecks in network and memory usage. This talk shares the improvements Workday has made to increase the threshold of relation size under which broadcast joins in Spark are practical. We cover topics such as executor-side broadcast, where data is broadcasted directly between executors instead of being first collected to the driver, as well as the changes we made in Spark's whole-stage code generator to accommodate the increased threshold for broadcasting. Specifically, we highlight how we limited the memory usage of broadcast joins in executors, so that we can increase the threshold without increasing executor memory. Furthermore, we present our findings from running these improvements in production, on large scale jobs compiled from complex ETL pipelines. From this session, the audience will be able to peek into internals of Spark's join infrastructure, and take advantage of our learnings to better tune their own workloads.
In this talk, we will share how we benefited from using Apache Spark to build Workday's new analytics product, as well as some of the challenges we faced along the way. Workday Prism Analytics was launched in September 2017, and went from zero to one hundred enterprise customers in under 15 months. Leveraging innovative technologies from Platfora acquisition gave us a jump-start, but it still required a considerable engineering effort to integrate with Workday ecosystem. We enhanced workflows, added new functionalities and transformed Hadoop-based on-premises engines to run on Workday cloud. All of this would not have been possible without Spark, to which we migrated most of earlier MapReduce code. This enabled us to shorten time to market while adding advanced functionality with high performance and rock-solid reliability. One of the key components of our product is Self-Service Data Prep. Powerful and intuitive UI empower users to create ETL-like pipelines, blending Workday and external data, while providing immediate feedback by re-executing the pipelines on sampled data. Behind the scenes, we compile these pipelines into plans to be executed by Spark SQL, taking advantage of the years of work done by the open source community to improve the engine's query optimizer and physical execution. We will outline the high-level implementation of product features, mapping logical models and sub-systems, adding new data types on top of Spark, and using caches effectively and securely, in multiple Spark clusters running under YARN, while sharing HDFS resources. We will also describe several real-life war stories, caused by customers stretching the product boundaries in complexity and performance. We conclude with the unique Spark tuning guidelines distilled from our experience of running it in production, in order to ensure that the system is able to execute complex, nested pipelines with multiple self-joins and self-unions.