Disney+ has rapidly scaled to provide a personalized and seamless experience to tens of millions of customers. This experience is powered by a robust data platform that ingests, processes and surfaces billions of events per hour using Delta lake, Databricks, and AWS technologies. The data produced by the platform is used by multitude of services including a recommendation engine for personalized experience, optimizing watch experience including group watch, and fraud and abuse prevention.
In this session, you will learn how Disney+ built these capabilities, the architecture, technologies, design principles, and technical details that make it possible.
Martin Zapletal: Hello everyone. Thank you for coming to our presentation. My name is Martin Zapletal and I’m a Director of Engineering at Disney Streaming.
Rekha Bachwani: Hi folks. My name is Rekha. I’m a Principal Engineer at Disney Streaming. Today, Martin and I are going to talk about how we optimize Disney+ customer experience for you from a data perspective.
So as you know, Disney started as a streaming service only a couple of years ago. However, it has grown at a very rapid pace to over a hundred million subscribers. In addition, underneath Disney Streaming that are a bunch of diverse services like ESPN+, Disney and others, and the content spans across life TV, sports, pay-per-view events, and video on demand. In order to co-host this heterogeneous content, we have invested in building a self-service platform, comprising libraries, APIs, and tooling, instead of just building pipelines and bringing services in a loosely coupled fashion. This platform that we have invested in provides a low latency solution and image petabytes of data, which further powers downstream analytics and learning-based solutions that we’re going to talk about later in the presentation.
So let me start with a couple of use cases that we have onboarded using this approach. The first one is I think you are all familiar with giving you a personalized homepage backed by a recommendation engine. The second, which may not be as visible and mostly exists in the backend, is what happens when a user hits play, how do those requests get routed?
That leads us to how we solve the traffic routing problem. We do that by optimizing service level metrics like utilization latency, user level characteristics, like the device they are coming from, what are their bandwidth constraints, and the CDN routing where which content is hosted where.
Finally, the third use case which is very different from these two, is fraud detection and prevention. And we’d like to do that without affecting good users. Now, these three use cases may sound like machine learning and they can be powered by data and machine learning. However, they have very different characteristics. And so as we walk through rest of our presentation, we’ll highlight some of the decisions that were influenced by being able to support these heterogeneous use cases.
So in terms of supporting these use cases, what are our main goals or the goals that we set for us when we started building the platform? The first goal is that we want to give you a personalized experience and we want to scale it to all our users, irrespective of the part of the planet they are coming from, and across all our diverse content catalog.
The second is, as we get more users that are spread all over the planet and our content becomes more and more diversified, we want to be able to support multitude of devices that are coming up, and we want to be able to do this seamlessly as we span networks and geographies. Finally, our ultimate goal, as I have said already, is to provide you with a seamless watch experience across all our platforms irrespective of where you’re coming from, what device are you streaming from, and what content are you streaming from?
To walk you through this for the rest of the presentation, we’ll focus on two use cases. The first one is traffic routing that I already described. The second is the personalization aspect of it. And how do we do that? Keeping these two use cases in mind and the main goals that are described, let’s go over what our approach is.
As I said, the goal was to build a self-service platform. And the bet that we took is the platform should be able to ingest, transform, validate, and persist data at scale. And the way it does it by capturing all the relevant interactions that we have, whether it is the users interacting with our service or different microservices in the backend interacting with each other. We want to capture all of those in a meaningful fashion and in a fashion that is useful for the downstream analytics.
Finally, we wanted a provision to continuously iterate and improve the platforms’ offerings as well. And again, that’s where the data comes into picture. So the second key goal, or the second bet that we took for this approach is embedding learning data and intelligence into its DNA. So what it means is we log, as I said, log most of the interactions that we have, and we leverage what the past behavior and the traffic patterns that emerged from based on this data, to sort of continually improve the experience and improve the efficiency of our infrastructure. And we do that again at scale and across all devices, geographies, and network constraints that we have to adhere to.
So with that, I’ll hand it over to Martin and he can give you an overview of what architecture is and dive deeper into our platform.
Martin Zapletal: Thank you, Rekha. So this is at a very high level, the architecture and how we think about our data and machine learning ecosystem. So on the left hand side, we have our users and devices, being data producers and sending data through this edge ecosystem, which is responsible for the connection with the devices and transporting the data to the streaming data platform. Our services being another important producer of data. They expose their datasets in the streaming data platform ecosystem where the data can be processed or subscribes to. That data is also often ingested into our data lake for further batch processing and analytics.
The analytics and machine learning platform here can leverage both the batch and stream processing to build machine learning use cases. And lastly, at the very bottom, we have our experimentation framework that allows us to build experiments and experiment with our features.
Throughout this presentation, we will focus on this section of that ecosystem. And I want to highlight two main concepts that we will talk a little bit about throughout the rest of the presentation. So first of all, the platform teams build libraries, services, tools, automation, an essentially self-service integrated and decentralized ecosystem for a unified work with the data. It enables the organization, so often other teams, to leverage what they need to solve their use cases and build their data first, and analytics and machine learning solutions.
And the second thing is that because of how we build the ecosystem, it allows us to build data solutions at various layers, potentially using different technologies and often having different SLOs. So teams can leverage batch processing and offline analytics. They can leverage nearline and online use cases as well. So using streaming or even use these data use cases in their services to handle user requests and operational use cases.
So the data ecosystem allows us to build not just analytical, but also operational use cases that impact customer experience in real time. And often it’s both. So the line between the analytical and operational, but also between data and services is sometimes a bit blurred and that’s intentional.
So let’s talk about how we achieve that in our ecosystem. This is an example pipeline. It’s not the pipeline that exists in our ecosystem, but it’s close enough to be able to demonstrate the concepts. And secondly, the overall ecosystem that supports Disney+ and similar solutions is much larger than this, with many more producers, many more consumers and users of the data.
So just at a very high level, on the left hand side here, we have the devices producing the data. And at the top, on the left hand side, we have services, again being producers of data. At one point, we join the two streams together. We do some enrichment and routing and filtering, and then send the data downstream for users to leverage. To these two Amazon Kinesis Data Streams instances, one of them is used for data that fails validation, the other for the data that passes validation. And then we use Databricks and Spark to ingest the data into our data lake, here represented by Amazon S3.
And at the very bottom, we have two other pipelines that process the data in real time. One uses Databricks, the other one Amazon Kinesis Data Analytics for Apache Flink, and they apply real-time aggregations and analytics, and then make the results of that process available to downstream consumers to use or visualize.
Now, one of the first things that make data usable across the organization is automated data management. That includes schema management, evolution, quality, governance, access control, discoverability, security, privacy lineage, and all of these other attributes associated with managing the data. And we need to apply the same management throughout the ecosystem from our producers to consumers. So from the producers all the way to our data lake. And especially this is really important in our real-time and operational use cases, so each use case does not have to deal with the data issues in real time, which makes it really challenging.
So we build our own solution, called Schema Registry, which supports all of those attributes that I just talked about, and also integrates with our other tools in the data ecosystem. It offers a centralized view of the data and quality definitions, which is then shared with all of the users of the data, the producers and consumers and other users.
The way we designed it, was to separate declarative definition of the data expectations from the execution, so enforcement of those definitions from the subsequent actions and what do we do about any potential issues? So we defined quality checks. The defined quality checks can be executed at different places in the pipeline and also different stages of the development life cycle to meet the required SLOs for each of the use cases.
So for example, we generate functional scholar code to capture errors at compile time or during development. We expose APIs to validate data again when producing data and during automated QA tests. And we also have integration with Databricks and Spark for checks during ingestion [inaudible] or the processing of the data later in the data lake as well? So we maintain the correct format standards on quality and all of the other guard rails throughout the ecosystem, which allows us to enable other teams to start building data solutions and solutions leveraging these tools and machine learning and analytics tools. And we allow them to do this easily and be able to iterate quickly and focus on their particular use case.
Another example of how we think about the platform services is providing self-service patterns. Teams are always allowed to build their own solutions, ideally using the SDKs tools and services that I just talked about, but there are also some patterns that are repeated, and teams can use them and deploy them just using a configuration. So on this example, these two blue boxes are essentially the same thing. The creation of a view from a stream or creation of a snapshot from a screen, which is a common example. In this case, we ingest Amazon Kinesis Data Streams into Amazon as three using Databricks. And then we have another Databricks job that processes the data and creates well-formed delta tables that are then used as the basis for our [inaudible] data lake downstream.
And the reason again, we can do this is because of the Schema Registry we’ve mentioned previously. We know the shape of the data. We know the expectations, such as the quality, of the data that we expect, and the checks that we need to apply. And we also have the generated code that allows us to efficiently transform between formats, such as JSON or Protobuf industry to the Delta formats in our data lake. And secondly, since Delta support both the streaming thing and streaming sources, we can actually continue streaming downstream if we need to depending on the use case and its SLOs. And when talking about platform and especially the operational use cases that support Disney+, we need to have the same bar for quality of the services that our other backend services do. So we need to build tools that solve problems around observability and reliability and resilience, elasticity, and all those carrying or cost efficiency to again, meet the same bar as our other services, to be able to support these operational use cases at scale.
This is just an example, a throughput in the same stream in two of our geographical regions. And you can see that at one point, one of the streams of traffic drops to essentially zero and the other stream picks up the traffic and then eventually things stabilize, go back to normal and we continue processing the data. And this happened because we lost one of our regions because there was a regional fail-over. Fortunately in this case, this was part of our chaos testing initiatives. So we actually wanted this to happen. And the reason we wanted this to happen most to gain confidence in our solution and confidence in that we can maintain RSOs such as the reliable delivery, ordering of events, latency, or data availability and timeliness, because it differs between some of our use cases, but many of our use cases require some of these guarantees.
So we built a number of tools that make this possible, and again, make it easy for teams to build solutions, leveraging these tools such as a deployment management and an automation tool called FENOs or an AB or blue-green deployment for streaming solutions that minimizes the downtime and the latency and back those deployments of new versions or an auto scaling tool using the structured, spark structured streaming literally center using the metrics available there, and then using the Databricks API to scale the cluster up and down as needed. So those are just some of the examples, but there are definitely many more. And lastly, let’s talk briefly about the tech stack. So we use Amazon Kinesis Data Streams, AWS MSK as the backbone of our streaming solutions and AWS S3 for a lot of our batch processing. And then on top of them, a number of technologies like AWS Kinesis Data Analytics for operations blink, AWS Lambda and AWS client libraries.
But we also rely heavily on Databricks for both streaming and batch processing. And we use other technologies like Airflow, JupyterHub and many others, but importantly, on top of them we build this ecosystem of tools for streaming the data platform and machine learning platform and the experimentation framework that provide an obstruction layer that allow our services teams to build their solutions, potential leveraging data and analytics, but also our data teams to build their data processing solutions and UTLs and everyone to leverage the analytics and machine learning solutions, again a different layers as needed. So with that, I’ll hand it back over to Rekha to walk through the details of the machine learning platform and how the use cases map to the platform.
Rekha Bachwani: Thanks Martin for the retail description of the platform and the awesome tooling and services that it provides. I’ll just briefly touch on the ML platform side of things. As part of the ML platform, as Martin mentioned, it’s more of an abstraction on the underlying technologies like Databricks, JupyterHub, EMR, and so as part of ML platform, what we are looking to provide is simplify ML development and deployment. So we rely heavily on Databricks notebooks and actually scheduling jobs via Kinostart we built in-house. In addition, we leverage sort of custom Docker images, which have the inbuilt packages, ML packages, and specific versions, and provide them as Docker image so that it’s easier for upstream Analytics and ML. In addition, using both Databricks and EMR, we are trying to build a sandbox environment where folks can test their ML pipelines before they are ready to deploy in production with all the production data, so that we are confident when we are deploying our models, whether on the streaming platform or in our rec engine service.
So next I’ll dive deeper into the actual Analytics and ML use cases. The use cases stay the same, but I’ll go into deeper details of what kind of models do we put in place. How does a typical pipeline look like and so on? So let’s go back to the use cases. I know you’ve heard these use cases throughout the presentation, but I’ll dive a little bit deeper into them. So the first one that I’d mentioned was traffic routing, but before we get to traffic routing, we had to solve and understand and be good at predicting our seasonality and traffic trends. To achieve that we started off with some simple statistic models and time series analysis, eventually sort of getting better at them to accurately do forecast like accurately forecast our demand. And when I say demand, demand is in terms of based on if we have live events going on, if you have higher, if it’s prime time for folks to watch a video on demand, things like that.
So they are taken into account and we are looking towards improving our demand forecasting. In addition to using the time series analysis and demand forecasting, we are also looking into leveraging user preferences. So network affinity or CDN affinity is one aspect of it, but we’re trying to incorporate the type of devices that our user is streaming from. So for instance, if a user is streaming from an Android device in some rural or remote location, serving them for high-resolution content or at the highest speed may actually be detrimental to that experience because it’s going to drain their battery faster and things like that. So we trying to take into account sort of devices from which the requests are originating, the geographical location from which the request is coming so that we can best optimize for the user experience. In addition to the optimizing the user experience, they’re also looking at doing it efficiently.
So one of the site goals for us is to have high operational efficiency. What I mean by operational efficiency here is we look into optimally allocating our resources. So for instance, if someone is streaming a live event from a 4k television versus another person is streaming the same event from, say a mobile device and a third person is doing video on demand. If we leverage analytics and machine learning appropriately, we could actually optimize all of these resources and yet be efficient from the infrastructure side. And so those are all the things that we look into account, where we look at the service characteristics like the content metadata, the network, and the bandwidth, the user characteristics, and our predicted demand and seasonality patterns. We combine all of them into providing you with the best experience. Now I spoke a lot of what kind of ML we can use, how we put those things, but these are each different pipelines.
And how a summit, let me talk a little bit about how we bring those things together. But before I do that, here’s how a typical pipeline would look like. So there is even data which has the streaming data coming in. Now, there are two parts that this data can follow. One. I think Martin describes those in greater detail, but sort of put it in an abstract fashion. It gets transformed. A schema is overlayed and it ends up in our data lake. The other thing is if the model requires some real time features or real time data, then some of the real-time data also goes into. Next is the feature engineering step, which combines both the batch data, as well as the real time data and computes a set of features that are necessary for our models in production. The blue box of machine learning that you see, it sort of represents an Abstract ML platform here, that on time ML platform where there might be heuristics and metrics that are computed a time series analysis for sediment forecasting, or sydian routing some predictors and classifiers for our fraud and abuse prevention.
And the one thing that I haven’t mentioned here is a recommendation engine. So these features are available to any of the models in our runtime, which can then score the incoming requests and output the decisions to the respective services. Now, each service is empowered to use output of these models in a way that they defect, but what we do is we log all the decisions that the services made into our data lake. And some of those are brought back and show up again as a streaming data. So sort of it’s a continuous loop that goes in, and the reason we are able to do it because of some of the platform architecture decisions that Martin described in detail, we have made. So I hope this provides you with a good overview of our ML ecosystem and how it sort of benefits from the underlying platform architecture that we have chosen.
So with this, I’ll just wrap up and go over, so we talked about how we optimize for personalized experience at scale. And the key bet that we have made in order to achieve personalized experience is make data a sort of integral to our ecosystem. In addition to data, we are looking to sort of have a continuous loop of feedback to sort of keep this iteration, continuous improvement process going on. And none of this would be possible without the self-service platform comprising of libraries, SDKs services, and automation.
And our focus on solution enablement and being itself service rather than just building pipelines and putting them together into a run time. And finally, I think this rapid scale at which we have emerged from a new streaming service to the huge member base, wouldn’t be possible if we didn’t use a lot of open source technologies and strong vending partners like data bricks who have provided us managed solutions to aid this rapid development. So with that, thank you for your time. Thanks for taking the time to listen to us. We are available to answer any questions you may have for us. And we as always, we are always hiring. So please visit disneytech.com to see if any role speak to you, or if you have any questions about them, feel free to reach out to Martin and I. And finally, your feedback is important to us, so please don’t forget to rate and review our sessions and share any feedback that you have for us.
Rekha is a Principal Engineer at Disney+ with expertise in machine learning, security and distributed systems. She leads the ML Engineering team that drives ML strategy for the services and engineerin...
Martin is a Director of Engineering in the Disney+ team. Martin is responsible for the company’s data platform strategy and real-time decisioning capabilities, leveraging his technical leadership an...