customer story
Migrating to the cloud to enable the world’s largest digital library

Scribd moves from Hadoop to Databricks to lower costs, increase scale and unify its data team

INDUSTRY: Media and Entertainment

SOLUTION: Personalization Recommendation Engine

TECHNICAL USE CASE: Data Ingest and ETL, Machine Learning, Data Analytics

Scribd is on a mission to change the way the world reads. With over 60 million titles in its online library, it’s focused on leveraging data and AI to uncover interesting ways that get people excited about reading. Challenged with a legacy Hadoop infrastructure that was too rigid and couldn’t scale to meet their real-time needs, Scribd switched to Databricks on AWS and Delta Lake for its performance, elasticity, and ease of use. This migration to the cloud has eliminated infrastructure complexity, allowing their data team to operate with agility, build fast and reliable data pipelines, and easily collaborate on models that deliver an engaging experience to their customers.

Building a scalable infrastructure for reading without limits

Scribd has taken on the task of getting people excited to read. With millions of titles available on their online platform, the business challenge they are trying to address with machine learning is to encourage reading without limits by serving up the right content to the right people.

The key to building an effective recommendation engine is data, but the challenge the Scribd team faced was the inability to process their massive datasets — both batch and streaming — quickly for downstream analytics and machine learning. Hampered by a rigid on-premises Hadoop infrastructure, they struggled with performance and maintenance at scale. Adding to the problem was the fact that small files made up over 70% of their data, contributing to performance gaps and high operational costs.

Once the data did make it downstream, siloes further hampered data team productivity. Data engineering wasn’t able to effectively support the data scientists, and the inability to share and reuse code and models slowed machine learning innovation.

“Over time the needs of the business have changed,” explained R Tyler Croy, Scribd’s Director of Platform Engineering. “We now need more machine learning, more real-time data processing, and more support for teams collaborating to deliver new data products; we needed something better than what we had in place.”

A unified platform that’s lightning fast, collaborative, easy to use

The first step the Scribd team took was to move to the cloud (AWS) to take advantage of its elastic infrastructure and range of developer tools. They turned to Databricks as their unified data analytics platform to simplify the management of their data analytics workflows — significantly improving development velocity and cross-team collaboration.

With an elastic infrastructure that features optimized Spark clusters, Delta Lake with Delta Caching, and a decoupling of compute from storage, the small file problem that plagued the engineering team was a thing of the past. Delta Lake is key to empowering them to build performance-optimized data pipelines that support both historical and streaming data with ease.

“Delta Lake has unified our streaming and batch applications, allowing us to deliver fresh data faster. We are also able to stream data into S3 without any performance or consistency issues which has solved our small file problem,” expressed Croy.

Another benefit of Delta Lake has been a consistent view of the data across the organization. Data can stream as cascaded tables, allowing all their workloads to be easily accessed and consumed from a single table. “Delta Lake allows our users to look up current information or go back in time,” said Croy. “As a result, our data customers now have one place to go for their needs which has put the power of our data in their hands.”

With data flowing to the analytics and data science teams, they are now able to collaborate and share code via interactive notebooks, greatly accelerating development. “Databricks’ interactive notebooks proved to be such a killer feature for our developers and analysts, who had to date, been collaborating by sharing code via copy and paste,” stated Croy.

Another component of the Databricks platform that has proven valuable is MLflow, allowing the data science team to streamline the machine learning lifecycle. What used to be a highly manual and error-prone process is now automated and time-saving. “We used to waste so much time with manual handoffs of code and models. With MLflow, we are able to work on model training and versioning while the engineering team is deploying and serving the model,” explained Amir Hajian, Director of Applied Research and Data Science at Scribd.

With a scalable and performant platform that has simplified infrastructure to better support data analytics and machine learning workflows, data teams at Scribd are well equipped to take their products to the next level with a more engaging experience that drives customer lifetime value and retention.

Reaping the benefits of a unified approach in the cloud

The move to Databricks on AWS from their legacy Hadoop infrastructure has been a game-changer. Infrastructure management and data engineering have been greatly simplified and optimized for performance. With query execution powered by Databricks Runtime, Scribd experienced an optimization of 30-50% for most traditional Spark workloads. “At a 17% optimization rate, Databricks would reduce our AWS infrastructure cost so much that it would pay for the cost of the Databricks platform itself,” stated Croy.

Looking ahead, the data team at Scribd wants to expand the support for more data streams which will change the equation on how they access the data, how they work with it, and ultimately answer more questions that will result in better and more personalized experiences for their customers.

  • 30-50%
    reduction in operational costs

Migrating from an on-premises infrastructure to Databricks on AWS was key to unlocking the possibilities of our data and enabling our data team to thrive.”

– R Tyler Croy, Director of Platform Engineering at Scribd

Related Content

Blog: Accelerating developers by ditching the data center
Technical Talk at Spark + AI Summit 2020