SparkSQL’s promise of SQL at the speed and scale of Spark DataFrames has made it the go-to interface for big data visualization tools. However, the problem of quality controlling big data threatens this promise. Existing quality control tools do not scale to data sets of the size manipulated through SparkSQL. Due to this hole in the big data ecosystem, users are feeding unvalidated data sets into SparkSQL-backed data exploration tools. The resulting analyses are often misleading as bad data input results in the data exploration tools creating garbage output. We will present a scalable, Spark-based framework for creating, running, and managing data quality tests so that users can trust and learn from their big data exploration tools.
David Durst, Analyst, is a member of the Advanced Data Analytics team in BlackRock Solutions' Financial Modeling Group. Mr. Durst currently works on developing user-friendly tools for quality controlling, understanding, and modeling large datasets. Mr. Durst graduated from Princeton University in 2015. He earned a Bachelor of Science in Engineering degree, Summa Cum Laude, in Computer Science and a certificate in Finance.