Jim Forsythe leads the Product Analytics & Behavior Science (PABS) team for the Technology, Product and Xperience organization at Comcast where he is responsible for transforming bits of data into consumable, productive insights. Jim’s day is challenged with building data pipelines, researching new ideas, developing key metrics and informing data-driven decision making. Prior to Comcast, Jim led data science teams for a fortune 500 management consulting firm. He specialized in large scale product analytics, cloud platforms, user behavior research and retention modeling for new product initiatives.
Comcast is the largest cable and internet provider in the US, reaching more than 30 million customers, and continues to grow its presence in the EU with the acquisition of Sky. Over the last couple years, Comcast has shifted focus to the customer experience. For example, Comcast has rolled out our Flex device which allows for customers to stream content directly to their TVs without needing an additional cable subscription. With the shift in focus to customer experience, Comcast has made a concerted effort to continue to make data driven decisions to understand how customers interact with our products while continuing to innovate with new products and subscriptions. The Product Analytics & behavior science (PABS) team plays a crucial role as an interpreter, transforming data into consumable insights and providing these insights to the broader product teams within Comcast. The PABS team does this on the entire Product ecosystem including X1, XFi and their brand new Flex devices, which is one of the largest streaming platforms in the world and this ecosystem is responsible for generating data at a rate of more than 25TBs per day with over 3PBs of data being used for consumable insights. In order for the PABS team to be able to continue to drive consumable insights on massive data sets while still being able to control the amount of data being stored, the PABS team have been using Databricks and Databricks Delta Lake to do high current low latency read/writes in order to build reliable real-time data pipelines to deliver insights and also be able to do efficient deletes in a timely manner. Some of the features from delta that we took advantage of to achieve the desired levels of efficiencies, optimization and cost savings are:
"Comcast is the largest cable and internet provider in the US, reaching more than 30 million customers. Over the last couple years, Comcast has transformed the customer experience using machine learning. For example, Comcast uses machine learning to power the X1 voice remote, which was used over 8B times in 2018 by our customers to find something they love to watch, get the latest sports statistics, control their home, or check their bill and troubleshoot their service using natural language. What all these different applications have in common is that to create and operate the machine learning models powering these applications we need to ingest many TBs of data on daily basis in an efficient and resilient manner, and need a machine learning platform that allows for fast exploration of new ideas while at the same time automatic deployment of the resulting machine learning models into a production environment that can handle Comcast scale. In this talk we describe our data and machine learning infrastructure built on Databricks Unified Analytics Platform including how Databricks Delta is used for the ingest and initial processing of the raw telemetry from our video and voice applications and devices. We then explain how this data can be used by both the product organizations to gain deeper insights into how our products are being used, as well as by our research and engineering teams to train and fuel the machine learning models at the heart of of these products. This keynote will also include an end-to-end demonstration of our machine learning platform that is centered around Databricks and MLFlow and how it integrates with other open source machine learning frameworks such as Tensorflow, PyTorch, Sklearn, H20 and Kubeflow to name a few."
Comcast has made a concerted effort to transform itself from a cable/ISP company to a technology company. Data-driven decision making is at the heart of this transformation, and we use data to understand how customers interact with our products, and we see data as the most truthful representation of the voice of our customer. My team, Product Analytics & behavior science (PABS) team plays the role as interpreter, transforming data into consumable insights. The X1 entertainment operating system, is one of the largest video streaming platforms in the world, and our customers consume more than a billion hours of content a week on X1. Our team consumes X1 telemetry at a rate of more than 25TBs of data per day and uses this data to inform our product teams members about the performance of and engagement with the platform. We also use this data to research customer behaviors to help better inform our product team members about areas of opportunity in our products, which range from fixing bugs to creating new features. To power these insights, we need to have a reliable real-time data pipelines to deliver these insights, and we need our data scientists and data engineers to be able to quickly and efficiently be able to develop and commit new code to ensure we can measure new features the product teams are developing. To do this in an environment at this scale, we have been using Databricks, and Databricks delta to gain operational efficiencies, optimization and cost savings. Some of the features from delta that we took advantage of to achieve the desired levels of efficiencies, optimization and cost savings are: · Distributed writes to s3 (essentially eliminating 500 errors) · s3 log with fast reads and ACID transactions (massive increases in s3 scans/reads, and enabling consistent views of the bucket/table) · Vacuum · Pptimize (which has allowed us to reduce a 640 node job to 40, and massively increase efficiencies of our clusters as well as our DS/DE’s)