Data wrangling tools let analysts build workflows to transform large and unstructured datasets into cleaned, well structured columnar data. A key strategy for validating the cleaned data is profiling, which provides value distributions, anomaly counts and other summary statistics per-column, letting the user quickly measure quality. While invaluable, profiling must impose a minimal runtime penalty on at-scale script execution. A basic profiling approach is to summarize each column’s values, and possibly across pairs of columns for drill-down. Even requirements as simple as these are littered with performance challenges, including data volume, cardinality of output values, number of columns, and inclusion of non-distributive statistics (e.g. median). We discuss our experience building profiling at Trifacta. We describe our first profiling engine and then focus on our new engine that casts profiling as an OLAP problem and leverages Spark to quickly generate query results. Its low latency enables ‘pay-as-you-go’ profiling, empowering users to explore their data iteratively, summarizing columns only as needed and executing focused drill-down queries too expensive to apply broadly. We see 10x-100x speedups with Spark and faster still in pay-as-you go cases.
Adam Silberstein is a director of development at Trifacta. His main area of interest is large-scale data processing, including in the batch processing and online serving spaces. His work has appeared in top database venues such as SIGMOD, VLDB, and ICDE. Prior to joining Trifacta, Adam was a Staff Software Engineer at LinkedIn in and a Research Scientist at Yahoo! Research. He completed his PhD at Duke University in 2007.
Amelia Arbisser is a software engineer at Trifacta. She works on a system for profiling data in Spark, and also contributes to the job execution stack. Prior to joining Trifacta, Amelia was an engineer at Twitter where she worked on relevance infrastructure for search and trends. She completed her Masters' in CS at MIT in 2012.