As a global technology and media company connecting millions of customers to personalized experiences, Comcast struggled with massive data, fragile data pipelines, and poor data science collaboration. With Databricks including Delta Lake and MLflow, they can build performant data pipelines for petabytes of data and easily manage the lifecycle of 100s of models to create a highly innovative, unique and award winning viewer experience using voice recognition and machine learning.
Infrastructure unable to support data and ML needs
Instantly answering a customer’s voice request for a particular program while turning billions of individual interactions into actionable insights, strained Comcast’s IT infrastructure and data analytics and data science teams. To make matters more complicated, Comcast needed to deploy models to a disjointed and disparate range of environments: cloud, on-prem, and even directly to devices in some instances.
- Massive data: billions of events generated by our entertainment system and 20+ million voice remotes resulting in petabytes of data that need to be sessionized for analysis.
- Fragile pipelines: complicated data pipelines that frequently failed and were hard to recover. Small files were difficult to manage, slowing data ingestion for downstream machine learning.
- Poor collaboration: globally dispersed data scientists working in different scripting languages struggled to share and reuse code.
- Manage management of ML models: Developing, training, and deploying 100s of models was highly manual, slow, and hard to replicate, making it difficult to scale.
- Friction between dev and deployment: dev teams wanted to use latest tools and models while ops wanted to deploy on proven infrastructure.
Automated infrastructure, faster data pipelines with Delta Lake
Comcast realized they needed to modernize their entire approach to analytics from data ingest to the deployment of machine learning models that deliver new features that delight their customers. Today, the Databricks Unified Data Analytics Platform enables Comcast to build rich data sets and optimize machine learning at scale, streamline workflows across teams, foster collaboration, reduce infrastructure complexity, and deliver superior customer experiences.
- Simplified infrastructure management: reduced operational costs through automated cluster management and cost management features such as autoscaling and spot instances.
- Performant data pipelines with Delta Lake: Delta Lake is used for the ingest, data enrichment, and initial processing of the raw telemetry from video and voice applications and devices.
- Reliably manage small files: Delta Lake enabled them to optimize files for rapid and reliable ingestion at scale.
- Collaborative workspaces: interactive notebooks improve cross-team collaboration and data science creativity, allowing Comcast to greatly accelerate model prototyping for faster iteration.
- Simplified ML lifecycle: managed MLflow simplifies the machine learning lifecycle and model serving via the Kubeflow environment, allowing them to track and manage 100s of models with ease.
- Reliable ETL at scale: Delta Lake provides efficient analytics pipelines at scale that can reliably join historic and streaming data for richer insights.
- Comcast also serves data to analysts using Tableau, providing a broader set of data for customer analysis, and at a high velocity.
Delivering personalized experiences with ML
In the intensely competitive entertainment industry, there is no time to press the pause button. Armed with a unified approach to analytics, Comcast can now fast forward into the future of AI-powered entertainment – keeping viewers engaged and delighted with competition-beating customer experiences.
- Emmy winning viewer experience: Databricks helps enable Comcast to create a highly innovative and award winning viewer experience with intelligent voice commands that boosts engagement
- Reduced compute costs by 10X: Delta Lake has enabled Comcast to optimize data ingestion, replacing 640 machines with 64 while improving performance. Teams can spend more time on analytics and less time on infrastructure management.
- Less devops: Reduced number of devops full-time employees required for onboarding 200 users from 5 to 0.5.
- Higher data science productivity: Fostered collaboration between global data scientists by enabling different programming languages through a single interactive workspace. Also, Delta Lake has enabled the data team to use data at any point within the data pipeline, allowing them to act more quickly in building and training new models.
- Faster model deployment: reduced deployment times from weeks to minutes as operations teams deployed models on disparate platforms