Scaling Online ML Predictions At DoorDash

May 26, 2021 11:30 AM (PT)

Download Slides

DoorDash is a 3-sided marketplace that consists of Merchants, Consumers, and Dashers.
As DoorDash business grows, the online ML prediction volume grows exponentially to support the various Machine Learning use cases, such as the ETA predictions, the Dasher assignments, the personalized restaurants and menu items recommendations, and the ranking of the large volume of search queries.

The prediction service built to meet above use cases now supports many dozens of models spanning different Machine Learning algorithms such as gradient boosting, neural networks and rule-based. The service supports greater than 10 billion predictions every day with a peak hit rate of above 1 million per second.

In this session, we will share our journey of building and scaling our Machine Learning platform and particularly the prediction service, the various optimizations experimented, lessons learned, technical decisions and tradeoffs made. We will also share how we measure success and how we set goals for the future. Finally, we will end by highlighting the challenges ahead of us in extending our Machine Learning platform to support the Data Scientist community and a wider set use cases at DoorDash.

In this session watch:
Hien Luu, Sr. Engineering Manager, DoorDash

 

Transcript

Hien Luu: Hi, everyone. Welcome to the Scaling ML Predictions at DoorDash session. My name is Hien Luu and I lead the ML platform team. In this presentation, I will present the first part, and Arbaz, an engineer on my team, will present the second part.
First, I will share a little bit about the DoorDash marketplace, then briefly talk about a few interesting machine learning used cases, and discuss some details about the approach and the strategies we use for building our ML platform. Arbaz will talk about a main part of the presentation, which is the Scaling ML Online Predictions. Then, we’ll wrap up with a few lessons learned and future work.
DoorDash mission is to grow and empower local economies, and we do that through a set to product offerings. First, delivery and pick up, where customers like you and me can order food on demand for breakfast, lunch or dinner or for various parties and similar locations. Second, an emergent product convenience and grocery, where customers can order non-food related items such as flowers, charcoals for the barbecue, or even sugar for baking cookies. The product is called DoorDash pass, a subscription product that is similar to the Amazon Prime. DoorDash also provides a logistics platform that powers the delivery for some of the notable merchants that you see in the slide, like Chipotle, Walmart or Target.
Now, DoorDash platform is a three-sided marketplace, and like most marketplaces, it has the flywheel effect, and we’ll share some details about that. Starting at the center and moving to the right, as we bring on more customers there will be more orders, and that creates more earnings for Dashers, and that improves the delivery speed and efficiency for the marketplace. Now, starting at the center and moving to the left. Again, as we have more customers, then there will be more revenues for the merchants. The more merchants we have on the marketplace, the more selections there are for customers like you and me. Now, as this flywheel speeds up, it brings benefits to all three sides. It drives more growth for margins. It creates greater earnings for Dashers and brings convenience and selection for all of us as customers.
Now, let’s talk about the machine learning use cases at DoorDash. I would describe these use cases by going through the food ordering steps, which consists of creating order, order checkout, dispatching order and delivering order. Each of these steps has its own interesting machine learning used cases. Now, let’s start with creating order. This is where customers like you and me land on the homepage and decide what to order for breakfast, lunch, or dinner. As far as recommendation goes, we would like to provide a personalized food ordering experience for customers by surfacing up the best stores for them. For searching and ranking, when a customer searches for something like Italian food, we want to show the most relevant results and rank them in the respects of distance and quality. We don’t want to show restaurants that are 40 or 50 miles away.
At the bottom, you see a promotion from particular merchants. For merchants, we would like to help them to maximize their sales by providing targeted promotions of certain items at certain time of the day. Onto the order checkout. Now, at the top, based on the items on the shopping cart, we make recommendations about the side dishes that often go together. For example, if you order hamburgers and forgot to order fries, we’ll make sure to recommend that. At the bottom, this is related to fraud related use cases. Like most internet e-commerce companies, we need to deal with a financial transactions and a credit card related issues. For the dispatching step, this is where the logistic engine kicks in. It sends the order to the merchants and then signs the order to Dashers. The engine’s primary job is to dispatch the right Dashers at the right time for the right delivery to maximize the marketplace efficiency.
In terms of the food preparation time and the travel time estimate predictions, they have a few interesting nuances such as traffic, weather, whereas we know we don’t have much control of. Another interesting challenge in this area is, unlike Amazon orders, food orders emerge on the fly and the food quality plays an important role in the delivery, like the extra five minutes could be the difference between a great meal versus an awful meal. For the last step, delivering the order. Regarding Dashers, the first step is how best to spend the right amount of money to bring on new Dashers on to the marketplace. Once they’re onboarded, how to mobilize them to come out on certain night of the week or position Dashers at the right location to satisfy the demand on the marketplace. Now, in terms of the delivery assignment, we’ll average machine learning to come up with optimal matching of Dashers to deliveries with these goals in mind, to ensure Dashers get more done in less time and consumers receive their orders quickly.
The last use case is once the food is delivered to the customer’s doorstep, we would like to verify the accuracy of that delivery. Now, let’s move on and talk about our machine learning platform journey. Our journey started about a year ago when DoorDash decided to invest in building a centralized platform. As usual, different ML teams have slightly different needs, so it is very important to establish the buy-in from those teams on the idea of a centralized platform. One of the ways to do that is to have a representation from each of those teams, as well as an open forum to voice the collective needs, opinions, and also be open to the contributions to the development of the platform. The collaboration model we establish includes machine learning council, that is responsible for making cross-functional team decisions such as bill versus buy. It also acts as a glue, which holds these different teams together so we can collaborate closely and learn from each other. Now, we don’t claim this approach is the only right one, but the one that works for us so far.
These are the pillars in our platform. They are modeled after the ML [inaudible] process. It starts with feature engineering and then all the way to model predictions. The approach we use in building these pillars starts with thinking big, by working backward from the customers to put together a product vision document. Then, building these pieces out incrementally based on the needs of a community while keeping an eye out for the long-term vision. Now, I will share some technical details about some of the pillars.
Start with feature engineering. They’re historical and real-time features. The historical features are generated from the Snowflake Data Warehouse in DataLake. The feature service is responsible for uploading billions of features onto the feature store. Then, for the real-time features, we build a framework on top of Apache Flink to generate this type of features, which are also written to the feature store. For model training and management, while in the exploratory stage, data scientists have the flexibility to train and gather the models. But before deploying those models to production, those scripts need to be checked in ticket, where our model training service pick them up and train the model, and finally store those model into a model store. We leverage the compute resources from the Databricks hosted platform to train our ML platform models. We also leverage ML flow such as tracking experiment runs, training results, compute resources, library versions and such.
Finally, the model prediction pillar, where we have a centralized online prediction service that’s based on Kotlin and C++. It supports tree base as well as [pothos] models. And these models can be deployed easily via a runtime configuration. The prediction service can fetch features from the feature store if they’re not provided in the prediction requests. For the prediction results, I’ll log into data lake for data scientist to analyze and train the models. The next section, we’ll cover interesting challenges with supporting high QPS prediction use cases. Now, I would like to handover to Arbaz to talk about an interesting scaling challenges of our online predictions.

Arbaz Khan: Thanks again for a comprehensive overview of the machine learning platform at DoorDash. Hi everyone. My name is Arbaz and I am excited to share this journey of scaling online predictions. Just a quick word of why it’s critical to consider online predictions in an ML platform for scaling. In the graphics that you’re seeing before, there were two kinds of workflows, online and offline. The online predictions act as a bridge between practicing ML and applying it to the real world use cases. When you go close to a real world application, that’s when stakes get high. System failure in online where it could mean anything from a slow degradation to a complete website take down. That’s why it’s important to address scaling needs and reliability of online predictions much more than the offline ones.
When people think about scaling online predictions when starting in this domain or new in this domain, we ask this question that isn’t scaling online production same as scaling any other microservice, if you’re considering a microservice-based architecture? The answer is yes and no. Yes, because of prediction service. It lives in a microservice framework. It’s implemented as a microservice. It does these database operations, but it has these unique challenges that require a specialized treatment. Let me enumerate these challenges, so that you get the understanding here. First, [inaudible] prediction requests have larger payloads. It is because each of these requests have dozens of features that need to be passed in. It’s also a common practice to batch multiple prediction requests into a single request, because at times you’re not just making one prediction. You’re making multiple predictions for rendering a single page, for instance. As a result, the payload size increases. It could be anywhere from tens of kilobytes to hundreds.
Second, you need to make model computations, which are intense, but are still within the same [inaudible] constraint, as any of the microservice we’re having. Third, there’s one interesting way how prediction workflow is different, is that apart from production traffic, there’s also experiment traffic. A data scientist and machine learning engineers, they keep running shadow test or AB test to validate their models, which adds a factor of traffic into the overall consideration. Finally, auditing. Not everyone needs auditing for re-ensuring compliant purposes. But here, it’s even more important because here, a data scientist would use the auditing logs that is the outputs that come out of the prediction request to see if there are any issues in a deployed model in real-time. They will also use these logs to further train their models, which is a critical need for improving their models.
You would ask here that yes, I understand that there are these peculiar challenges you have to face when building. This is related to prediction, but what do you mean by scaling online predictions on this context? I would say that scaling can be best understood here from a user perspective. What are the users here? They are data scientists, machine learning people, and the systems, that is the microservices, the upstream microservices that would call your prediction service. The data scientists will keep pushing newer models and newer features into production. They will also keep experimenting rapidly, which is, as I said before, it adds to that traffic. It sort of multiplies the overall traffic by a factor goals for scaling and does the microservices.
These are these other sorts of upstream services that would call you and want you to reliably support higher and higher QPS because the product is growing or because there is a situational uptick in demand, either because there was this event that was happening. In our case, DoorDash last year with the pandemic, which was causing a boom for delivery industry. It was also the ideal event for us, which increased the exposure and caused a situation, uptick and demand.
When you approach these needs of scaling, I believe you go through these different phases of scaling and the increasing order of intensity. Then, the initial phase, as we roll out the system, the traffic picks up. You are in this happy phase where you could just throw more instances, more workers and shards and get the flow going as more traffic comes in, and the system will oblige happily. Then, we’ll come phase two, where we would see that horizontal scaling has its limits and we have to see that there are these users that are consuming more resources than the others. To be able to continue horizontal scaling, they have to be isolated out. After that, we just resume scaling. Resume scaling until we can’t anymore, because either our costs are shot through the roof, or there are inherent deficiencies in our system and so there is a need to optimize to survive.
Then finally, since everything would not be in your control, once your services grow big enough, like really big, things that you couldn’t control, such as the complete infrastructure, we’ll start seeing major negative effects. You will be just left monitoring all infrastructure and this might require a revamp of sorts or force [throttling] to survive at that stage.
Now, my goal with this stock is to make you live these four phases through our experience that we had at DoorDash. I want to share what precise scales did we have these phases, and what did we do to transition from them? First phase, the happy phase, the horizontal scaling. Let me just chalk out the essential components that will be involved here. The feature stores that are discussed for us. It’s deployed Y AWS elastic cache to be able to horizontally scale, we can just like increase shards on this cluster. The prediction microservice, as Hien mentioned before, it’s a Kubernetes based microservice. It’s elastically scalable, can add more service replica to it. The prediction store it’s a snowflake table, we partition it by multiple relevant fields, such as date, model ID. And then the link between the service and the prediction store is mediated via a stream processing to ensure throughput. And we have a stream processor that can elastically scale X number of workers as the traffic goes through that bridge.
So, we were able to use this whole set up of horizontal scaling to add up to a system and handle more and more traffic. As you can see here, this graph shows, how did we start turning upwards of 15,000 predictions per second by June of last year, starting at 1000 predictions per second on March, was just like within a span of three months, grown 15 X here, all on the basis of just like adding these workers, shards, or whatever. And then we observed that there are at least some models that consume more resources than the others. So just like pulsar table here for B 95, latencies, Dasher dispatch, or like one of the highest latency models is it’s 50 to 60 X, more than the lowest latency models. So you can see that there are more resources consumed by this high latency model and the low-latency models start getting cannibalized.
They will start seeing these degradation and their SLES. So even though you’re adding workers to the horizontal scaling cycle, but the degradation does not get any better. So, that’s when we transition. That’s how we transitioned to phase two at this point. When we realized that user isolation is the need of the art, and we start to think that what could be the strategies of isolation that we can employ here? One is the thing that we were using that we have all models live within once service deployment. It works when models are homogenous in size, and the usage patterns for these models are very similar, which is not often the case. And so there’s this other approach that you can have one service deployment per model. Here the concern will be caused primarily, because to be able to work in a cost efficient manner, you need some sort of order scaling in place because every time you have the service running, hosting a model, if their model is not getting used, you are wasting the CPU hours.
Auto scaling can help reduce that but it needs to be working perfectly for you to get any cost efficient system on there. The middle down here is a hybrid approach where some of the models can live within one single deployments, and you consciously made the decisions as to when to accelerate. So there are multiple service deployments. Each service deployment can host multiple models. You would ask that, is this the way to go always? Whether, it does seem like it’s best of the both worlds. In terms of extracting the cost, and it’s also mitigating bottlenecks. But the drawback here is that we need to be able to tell when to isolate here, and once we have isolated, we need to have an ability to take in mind that which models live there.
Assuming that we were able to go past these drawbacks, or like you were able to have the system in place. You’ll probably ask that how far did this attempt take you in terms of your scaling endeavors? Was it a happy scaling life ever after? No it was not. We were able to grow a big enough number though. We hit six figures in big predictions per second, but we couldn’t go as much higher as we had wanted to with this. We couldn’t avoid a shard here to be the limiting factor. But one benefit that we got was that not every use case out there in our system was impacted because of one child being old. Where it was a box situation, though, regardless for the limiting shot, it started growing much faster than the others. We started hitting AWS hardware limits. We couldn’t scale horizontally any further for this particular shard, but we still needed the predictions to keep flowing.
And that’s when we enter into the phase threes, until at this point, scaling was merely doing more work, but to hit for the larger scales in a sustainable manner, the mission was to do more with less. So it’s like doing more work with less resources. Most common example that I find here is a case of imperfect load balancing. Imagine you’re adding workers, but because traffic is not evenly balanced, you’re not shedding load off your existing workers. And the [inaudible] workers will continue acting as bottlenecks no matter how many modes you add. So optimizing load balancing here in this case is how you would scale to survive.
So how did we holistically approach this? Was that we did use multiple optimizations [inaudible] layers and the microservice layer, the feature stroller, and the approach here was to harvest a loop of observations and iterations. So for the microservice, an observation could look like you’re doing load testing, understanding those bottlenecks of how far the QPS can go before it starts breaking. You start to see also what are the latency profiles? What are the places where I’m spending the most time on? GRPC dressing was one such tool that came out helpful to us. We published on that as well in our blog. And we use these observations, the platform to make iterations. There were these low-hanging fruits that we could readily adopt, such as the parameter tuning, like what parameters could be [inaudible]?
These could be that, what are the sizes of our payloads? If we are batching much more than desired, we can cut the bat sizes, or if you’re not making use of the resources much, we can increase the bat sizes. A little certain gutter with their timeouts on connections and what we’ll expect. And then there were these bandwidth optimization that we can think of either disabling them or enabling them. And there was few more involved optimizations, which would payload compression, compile optimizations, and all of these in together helped us improve the microservice layer. And then it runs onto the feature store layer, try to see that the feature store that we had at that time, this cluster wasn’t the best that we had.
We did some benchmarking with our desk basked store to see that if a desk based store can help us cut our costs or help us get more performance out. But this benchmarking helped us understand that our latest cluster approach was the fastest and the cheapest one. So we had another option, but to optimize at the top of the existing implementation. So we took this approach of a complete scheme model design and see how can we make use of [inaudible] in a better way, in a more performant way. Results were in line with that. We reduce our cluster costs by 3 X, and also increased the performance by some of these optimizations. Again, this is also published in one of our blogs.
So all these efforts we bridged 1 million predictions per second, the magic number, and the system was running stable all this time. This was a major accomplishment for us, because we would use the setup for quite some time. We were able to hold up traffic, even up to 2 million predictions per second. It was not obviously the end of scaling at times, we were at the cusp to explore what is anything and anything can happen.
This is where we transition into the final phase that we’re going to discuss. It’s the phase at which we have started hitting scales that could impact the entire production infrastructure. So think of implications such as bringing all of the services down at once, because your service deployment went haywire. These are the kind of implication that can happen when service grows that big. Just like enumerating furthermore, that we encounter a logging and metrics, infrastructure stopped supporting us because we were hitting these limits that were not acceptable. Third party, when a segment for processing blocked us for sending too many events because we breached the rate limits there as well. In-house service discovery module console threatening the total outage.
When one of our service deployment went on continuously start loop for the parts where they were constantly sending, I press update requests to console, which was also helping out the other services. And because console was over in the other services were getting impacted. So yeah, this is how we tackle all of them, we just pick them one by one. For the logging and metrics, we got them to a bare minimum, just log whatever we need at the bare minimum level. For metrics, we moved from a pull-based monitoring framework to a push-based monitoring framework to cut the cost, to cut the number of events we were sending. We migrate away from a third party streaming window to an in-house Kafka solution to have more control on the deployments, the cost and the rate limit. We reduce our total number of pods to ease the load on a service discovery by making them bigger.
So instead of having a much larger number of pods, you have smaller number of pods, but a bigger in size. With this conscious and sample use of resources, we were able to hit much larger scales north of six million predictions per second. Although this wasn’t everything that we had wanted for, there were more scaling that we anticipate to come, and we’re keeping an eye out on some of the bottlenecks here as well. We’ll keep publishing or docking as we approach these higher and higher scales.
So if you were to ask me and all this information that I’ve shared so far, what are the three key takeaways. First, is that there should always be this understanding of when do I select. We took the approach of isolating use cases wherever possible. And then secondly, every time we faced a new scaling challenges, one easy way out is to scale out. That is just keep adding workers, keep adding norms. It usually helps. Especially in situations like Friday evening, there’s high-traffic, there’s QPS increase, nobody’s… let’s just scale-out. But if you’re constantly seeing more traffic than you originally planned for, it’s time to introspect and see how can you make the system better?
Because if you don’t, it will start to show up somewhere. The worst case, it will show up on the company’s infrastructure and it will not about just your service but all the services out there. That was my part. Thanks for patiently hearing that, I will hand it over to Hien for carrying you through the last leg of this presentation.

Hien Luu: Wow. Thank you Arbaz for walking us through those fascinating challenges, and how we tackle them. Here we will like to share a few lessons learned. For happy path, as Arbaz mentioned, the prediction service is a part of the user of production traffic flow. So ideally it shouldn’t cause an outage if something went wrong by fetching features and such. So therefore it is very important to have a proper default values in place. In terms of customer obsession, in addition to work with them very closely with ML teams, we also strive to understand the business problems they’re trying to solve. As well as the business impact. By having this understanding, it helped us in terms of the prioritization of our projects. For the last one big vision, and building incrementally. By building out our platform incrementally, it gives us the opportunities to receive feedback and also have to make the necessary adjustments along the way. And this has been really helpful.
In terms of future work, there’s a lot to do, but here are a few that we are considering. For micro service optimization, as definition, volume increases, and we expect that to happen in the near future, while looking into the potential caching, the feature values, as well as prediction results. I’m sure there’ll be many interesting technical challenges in these two areas. We’ve already in generalized model serving. So far most of the ML use cases are based on structured data. Moving forward, though, we’ll see more use cases in the MLP and image recognitions. So therefore, we need to extend our platform to support those model types. And lastly, the unified prediction client, as the ML platform adoption increases, we want to provide an easier and optimized way of making prediction requests for our customers.
So, thank you for attending this session. And here’s a link to the DoorDash engineering blog website, where you can find many interesting blogs about data science and machine learning.

Hien Luu

Hien Luu is a Sr. Engineering Manager at DoorDash, leading the Machine Learning Platform team. He is particularly passionate about the intersection between Big Data and Artificial Intelligence. He is ...
Read more

Arbaz Khan

Arbaz Khan

Arbaz is a Machine Learning Platform Engineer at DoorDash where he focuses on challenges around usability and scalability of online model serving. He has been directly involved in growing the scale of...
Read more