Automated Debugging of Big Data Analytics in Apache Spark Using BigSift - Databricks

Automated Debugging of Big Data Analytics in Apache Spark Using BigSift

Download Slides

Developing Big Data Analytics often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions about data. Data scientists typically write code that implements a data processing pipeline and test it on their local workstation with a small sample data, downloaded from a TB-scale data warehouse. They cross fingers and hope that the program works in the expensive production cloud.When a job fails or they get a suspicious result, data scientists spend hours guessing at the source of the error, digging through post-mortem logs.

In such cases, the data scientists may want to pinpoint the root cause of errors by investigating a subset of corresponding input records. In this talk, we present BigSift, an automated debugger for Apache Spark that data engineers and scientists can use. It takes an Apache Spark program, a user-defined test oracle function, and a dataset as input and outputs a minimum set of input records that reproduces the same test failure.

BigSift combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. It redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BigSift exposes an interactive web interface where a user can monitor a big data analytics job running remotely on the cloud, write a user-defined test oracle function, and then trigger the automated debugging process. BigSift also provides a set of predefined test oracle functions, which can be used for explaining common types of anomalies in big data analytics. This debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences.

Session hashtag: #Res3SAIS



« back
About Muhammad Ali Gulzar

I am a fourth year Ph.D candidate in the Programming Languages and Software Engineering (PLSE) Group at University of California, Los Angeles. My research interests lie at the intersection of software engineering and big data systems. Specifically, I am interested in supporting interactive debugging in big data processing frameworks and providing efficient ways to perform automated fault localization in big data applications.

About Miryung Kim

Miryung Kim is an associate professor in the Department of Computer Science at UCLA as well as the cofounder of MK.Collective. Miryung builds automated software tools, such as debuggers, testing tools, refactoring engines, and code analytics, for improving data scientist productivity and efficiency in developing big data analytics. She also conducts empirical studies of professional software engineers and data scientists in the wild and uses the resulting insights to design novel software engineering tools. Previously, she was an assistant professor in the Department of Electrical and Computer Engineering at the University of Texas at Austin and a visiting researcher at the Research in Software Engineering (RiSE) group at Microsoft Research. Miryung’s honors include an NSF CAREER award, a Microsoft Software Engineering Innovation Foundation Award, an IBM Jazz Innovation Award, a Google Faculty Research Award, an Okawa Foundation Research Grant Award, and an ACM SIGSOFT Distinguished Paper Award. She also received the Korean Ministry of Education, Science, and Technology Award, the highest honor given to an undergraduate student in Korea. Miryung holds a BS in computer science from the Korea Advanced Institute of Science and Technology and an MS and PhD in computer science and engineering from the University of Washington.