Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World - Databricks

Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World

Download Slides

In this talk, we will share how we benefited from using Apache Spark to build Workday’s new analytics product, as well as some of the challenges we faced along the way. Workday Prism Analytics was launched in September 2017, and went from zero to one hundred enterprise customers in under 15 months. Leveraging innovative technologies from Platfora acquisition gave us a jump-start, but it still required a considerable engineering effort to integrate with Workday ecosystem. We enhanced workflows, added new functionalities and transformed Hadoop-based on-premises engines to run on Workday cloud. All of this would not have been possible without Spark, to which we migrated most of earlier MapReduce code. This enabled us to shorten time to market while adding advanced functionality with high performance and rock-solid reliability. One of the key components of our product is Self-Service Data Prep. Powerful and intuitive UI empower users to create ETL-like pipelines, blending Workday and external data, while providing immediate feedback by re-executing the pipelines on sampled data. Behind the scenes, we compile these pipelines into plans to be executed by Spark SQL, taking advantage of the years of work done by the open source community to improve the engine’s query optimizer and physical execution.

We will outline the high-level implementation of product features, mapping logical models and sub-systems, adding new data types on top of Spark, and using caches effectively and securely, in multiple Spark clusters running under YARN, while sharing HDFS resources. We will also describe several real-life war stories, caused by customers stretching the product boundaries in complexity and performance. We conclude with the unique Spark tuning guidelines distilled from our experience of running it in production, in order to ensure that the system is able to execute complex, nested pipelines with multiple self-joins and self-unions.

 

Try Databricks
See More Spark + AI Summit in San Francisco 2019 Videos


« back
About Pavel Hardak

Pavel is Director of Product Management with Workday. He works on Prism Analytics product, focusing on backend technologies, powered by Hadoop and Apache Spark. Pavel is particularly excited about Big Data, cloud, and open source, not necessarily in this order. Before Workday, Pavel was with Basho, the company behind Riak, open-source NoSQL database with Mesos, Spark and Kafka integrations. Earlier, Pavel was with Boundary, which has developed real-time SaaS monitoring solution and was acquired by BMC Corp. Before that, Pavel worked in Product Management and Engineering roles, focusing on Big Data, Cloud, Networking and Analytics, and authored several patents.

About Jianneng Li

Jianneng Li is a Software Development Engineer at Workday (and previously with Platfora, acquired by Workday in late 2016). He works on Prism Analytics, an end-to-end data analytics product, part of the Workday ecosystem, which helps businesses better understand their financials and HR data. Being a part of Spark team in Analytics org, Jianneng specializes in distributed systems and data processing. He enjoys diving into Spark internals and published several blog posts about Apache Spark and analytics. Jianneng holds a Master’s degree in EECS from UC Berkeley and a Bachelor's degree in CS from Cornell University.