In our experience, many problems with production workflows can be traced back to unexpected values in the input data. In a complex pipeline, it can be difficult and costly to trace the root cause of errors. Here we outline our work developing an open source data validation framework built on Apache Spark. Our goal is a tool that easily integrates into existing workflows to automatically make data validation a vital initial step of every production workflow. Our tool is aimed at data scientists and data engineers, who are not necessarily Scala/Python programmers. Our users specify a configuration file that details the data validation checks to be completed. This configuration file is parsed into appropriate queries that are executed with Apache Spark. A status report is logged, which is used to notify developers/maintainers and to establish a historical record of validator checks. This work was inspired by the many great ideas behind Google’s TensorFlow Extended (TFX) platform, in particular TensorFlow Data Validation (TFDV). As such we provide optional functionality for our users to visualize their data using Facets Overview and Facets Dive.
Patrick received his PhD in Mechanical Engineering from the University of Pittsburgh in 2013. His research involved the intersection of high-performance computing and the simulation of turbulent reacting flows. In 2015 he joined Target as a data scientist where he has worked on product and ad recommendations.
Doug develops Machine Learning infrastructure for Target in Pittsburgh, PA. He joined Target in 2014 and is currently a Principal Data Engineer. He has a BS in Computer Science from University of Pittsburgh.