Dan Morris is Sr. Director, Data Platform at Viacom. In his current role, Dan is responsible for democratizing access to data with self-service capabilities and reducing time to insight with real-time analytics. Prior to this role, Dan was focused on growing Viacom’s global digital audience with Product Analytics. Dan holds a Masters from NYU and is currently pursuing his second degree at Northwestern University.
Viacom, the global media company, explains how they are using Apache Spark and Databricks to quickly adapt to their audience by building a just-in-time data warehouse that supports their aggressive campaign to roll out new apps around the globe using data-driven product development. Viacom is home to a collection of brands including MTV, Comedy Central, and Nickelodeon.
The presentation will focus on the way that we integrate human data validation into our data pipeline, utilizing the results both for measuring and tracking our data quality over time and for updating the training sets of our supervised machine learning models. Longer Version: - Radius uses human data validation to monitor the accuracy of data at both ends of our pipeline: raw input data from our sources and prepared output data (the Radius Business Graph). Because human validation is both financially and temporally costly, we want to extract as much value as possible from the results. To this end, we have developed a positive feedback cycle that allows us to both regularly and efficiently validate our data quality and maintain up-to-date training sets for our machine learning models. In the presentation I will cover: -- The primary data problem that our engineering team is responsible for -- The previous one-way flow of data through our pipeline -- The automated framework that we built (using Spark and DataBricks) to facilitate smooth human validation processes -- Challenges of using one dataset for both KPI analysis and labeled training data points -- The updated data pipeline with positive feedback cycle -- Lessons learned, including: the importance of consistent schemas and the harmonious union of data science and data engineering