As a Spark developer, do you want to quickly develop your Spark workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write end-to-end tests for your workflows and add assertions on top of it? In just a few years, the number of users writing Spark jobs at LinkedIn have grown from tens to hundreds, and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production. Currently, there is no way users can test their spark jobs end-to-end. The only way is to divide the spark jobs into functions and unit test the functions. We’ve tried to address these issues by creating a testing framework for Spark workflows. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output. In this talk, we will discuss the motivation behind the testing framework before deep diving into its design. We will further discuss how the testing framework is helping the Spark users at LinkedIn to be more productive.
Session hashtag: #EUde12
Anant Nag is a Senior Software Engineer at LinkedIn. He has worked on multiple projects involved in the Hadoop workflow lifecycle. He's one of the core developers of popular open source projects—Dr.Elephant and Linkedin Gradle plugin for Apache Hadoop. Currently, Anant is focussing on increasing Hadoop developer productivity at LinkedIn. He's working on an end-to-end testing framework for Spark workflows. Anant holds a Masters in Computer Science from Indian Institute of Technology(IIT), Madras.
Shankar has 17+ years of experience building distributed systems and productivity tools. He started out building a highly successful distributed test automation for windows and bing in microsoft. Then he spent the 8 years help build a middle tier platform that powered most of the online services that formed the backbone of bing and microsoft ads. He is currently leading the grid productivity team in bangalore. Empowering hadoop developers at linkedin to be more productive with their time and cluster resources.