Scribd uses Delta Lake to enable the world’s largest digital library. Watch this discussion with QP Hou, Senior Engineer at Scribd and an Airflow committer, and R Tyler Croy, Director of Platform Engineering at Scribd to learn how they transitioned from legacy on-premises infrastructure to AWS and how they utilized, implemented, and optimized Delta tables and the Delta transaction log. Please note, this session ran live in October and below are the questions and answers that were raised at the end of the meetup.
Watch the discussion
Questions and answers below have been slightly modified due to brevity; you can listen to the entire conversation in the video above.
How do you optimize then manage the file sizes in your Cloud? For example, when you have a lot of files going into your S3 buckets, right? That potentially can increase the costs, right? So how do you optimize all this? How do you improve the performance?
So one of the big reasons that we chose Delta Lake was we want to use it for streaming tables to work with our streaming workloads. So as you can imagine, when you are writing from a streaming application, you are basically creating a lot of small files. All of these small files will cause a big performance issue for you. Luckily, Delta Lake comes with the OPTIMIZE command that you can use to automatically optimize those small files and compact them into larger ones. From a user point of view, it’s transparently speeding up query retrieval. You just have to run the OPTIMIZE command to optimize the data and then everything will be taken care of for your Delta Lake.
From the writer’s point of view they don’t really care about optimization. The client(s) just write whatever data they want to the table, and you can do concurrent rights as well. While the readers do have to care about the small file problem, the writers do not. But running OPTIMIZE is safe to do because Delta Lake itself has MVCC. So it’s safe to optimize and concurrently write into the same table at the same time.
How has streaming unlocked value for your data workloads and have you had your users responded to this type of architecture?
When I’m doing streaming at Scribd, real-time data processing was that pie in the sky sort of moonshot initiative in comparison to the way that most of our data customers traditionally have consumed data.
They were used to nightly runs such that if anything went wrong with it, they might get their data two days from now. But how about if they wanted to look at AB test results for deployment that went out at 9:00 AM today? Using the traditional batch flow, they would be waiting until tomorrow morning or worst case until Saturday morning. But with streaming, the goal is that we want to analyze it as soon as data is created, we want to be giving that to the people that want that data to use it.
And there’s a couple of really interesting use cases that started to come out of the woodwork once we started incorporating streaming more into the platform – with one big one that was totally unexpected around our ad hoc queries. For starters, we enabled all of these people to use Databricks notebooks to run these queries. Because we’re streaming data to a Delta Table, from the user’s perspective it just looks like any other table. If you wanna pull streaming data into your ad hoc workloads and you don’t have Delta you might be teaching users how to connect to Kafka topics or pulling it into some other intermediate store that they’re going to query. But for our users, it’s simply a Delta Table that is populated by stream versus a Delta Table that’s updated via the nightly batch. It’s fundamentally the same interface except one is obviously refreshed a lot more frequently. And so users, in a lot of cases without even realizing it, started to get faster results because their tables were actually being streamed into as opposed to written from a nightly batch.
This was when some people started to recognize that they got that super power once and they were over the moon excited. I think the fastest time from data generated to something available in the platform that I’ve seen is about nine seconds. And that’s like nine seconds from the event being created from a production web application to it being available in a Databricks notebook. When you show somebody who’s used to having the worst case scenario of 48 hours for their data down to nine seconds – it’s like if you showed a spaceship to someone from the 1700s. It’s like they almost can’t even comprehend the tremendous amount of change that they just encountered and get the benefit from.
How has Databricks helped your engineering team deliver?
The biggest benefit we get is productivity boost; nowadays I think everyone agrees that engineering time is way more expensive than whatever other resources that you will be buying. So being able to save developer time, that’s the biggest win for us.
The other thing is being able to leverage the latest technologies that’s standard in the industry. Being able to use the latest version of Apache Spark™ and I have to say that Databricks has done a really good job at optimizing Spark. While not all the optimizations are available in open-source so when we’re using the Databricks platform we get all of the optimizations that we need to get the job done a lot faster.
Back in the old days, engineers have to compete for development machines. This is no longer the case as we can now collaborate on notebooks – this is a huge win! By being able to run your development workflows in the cloud, you can actually scale to any kind of machines you want to get your work done. If you need this work to be completed faster, you just add more machines and they would get them faster! I have to reiterate that all of the engineers really love the notebook interface that Databricks provide. I think that was also one of the main reasons that we chose Databricks from the beginning – we really loved the collaborative experience.
Can you tell us a little about what you are working on to allow Scribd make it easier for readers to consume the written word?
New recommendations are probably one of the most important parts of our future; what originally really attracted me to Scribd as a company is that the business relies on the data platform. The future success of Scribd is really, really intertwined with how well we can build out and scale and mature our recommendations engines, our search models, our ability to process content and get that back to users that they are going to find compelling and interesting. Because data is core with our content (the audio books, books, documents, etc.), it is core to what makes Scribd valuable and what makes Scribd successful. There’s this very short line between if we make a better data platform, if I can enable that recommendations engineer to do a better whatever they do, that’s immediately more success for the company. And so for us, recommendations and search are so crucial to the business and that our work on the data platform directly impacts that very key functionality is really, really exciting but it also means that we’ve got to do things right!
Just to cycle back a little bit to the technical side of things, I want to mention how Delta Lake enabled us to build better recommendation systems. As Tyler mentioned earlier we have this daily batch pipelines that run every day. And as you can imagine, if a user clicks on something or expresses an intent that they liked this type of content, what if they only get no new recommendations after that? That’s not good from a user experience.
With Delta Lake, we actually now stream that user intent into our data system and into our machine learning pipelines. And now that we can react to user requests in real time or near real time thus providing much better and fresher recommendation to our users. I think that this is proof that having the right technology to unlock all these possibilities for engineering teams prompt teams to build products that were not even possible before. So I think that was a big thing we got from using Delta Lake as well.
Watch the discussion here: https://youtu.be/QF180xOo0Gc
Learn more about how Scribd switched to Databricks on AWS and Delta Lake: https://databricks.com/customers/scribd