The presentation will focus on the way that we integrate human data validation into our data pipeline, utilizing the results both for measuring and tracking our data quality over time and for updating the training sets of our supervised machine learning models. Longer Version: – Radius uses human data validation to monitor the accuracy of data at both ends of our pipeline: raw input data from our sources and prepared output data (the Radius Business Graph). Because human validation is both financially and temporally costly, we want to extract as much value as possible from the results. To this end, we have developed a positive feedback cycle that allows us to both regularly and efficiently validate our data quality and maintain up-to-date training sets for our machine learning models. In the presentation I will cover: — The primary data problem that our engineering team is responsible for — The previous one-way flow of data through our pipeline — The automated framework that we built (using Spark and DataBricks) to facilitate smooth human validation processes — Challenges of using one dataset for both KPI analysis and labeled training data points — The updated data pipeline with positive feedback cycle — Lessons learned, including: the importance of consistent schemas and the harmonious union of data science and data engineering
Dan Morris is Sr. Director, Data Platform at Viacom. In his current role, Dan is responsible for democratizing access to data with self-service capabilities and reducing time to insight with real-time analytics. Prior to this role, Dan was focused on growing Viacom’s global digital audience with Product Analytics. Dan holds a Masters from NYU and is currently pursuing his second degree at Northwestern University.