10 Must Haves to Deploy Machine Learning
- Forrester Research On The State Of Enterprise Machine Learning
- Forrester Research On Enterprise Data Science
- The 10 Requirements Of Machine Learning Platforms
- The Future Of Machine Learning Platforms
- Introduction To Databricks Customer overstock.com
- Use-Case: Generating User-Scores To Rate Purchase Propensity
- Scaling Data Science At overstock.com On Databricks
- Machine Learning With SPARK
- The Business Impact Of Databricks At Overstock.com
- Here’s more to explore
- Forrester Research On The State Of Enterprise Machine Learning
- Forrester Research On Enterprise Data Science
- The 10 Requirements Of Machine Learning Platforms
- The Future Of Machine Learning Platforms
- Introduction To Databricks Customer overstock.com
- Use-Case: Generating User-Scores To Rate Purchase Propensity
- Scaling Data Science At overstock.com On Databricks
- Machine Learning With SPARK
- The Business Impact Of Databricks At Overstock.com
- Here’s more to explore
Want to watch instead of read? Check out the video here.
Thanks for joining us today. My name is Wayne Chan and I’ll be your moderator for today. Happy to present this webinar titled 10 Must-Haves to Deploy Machine Learning and AI in the Enterprise.
I’m very pleased to introduce our speakers today. We’ve got an amazing lineup here to share their experiences and findings around machine learning and artificial intelligence.
First, I’m pleased to introduce Mike Gualtieri. Mike is a vice president and principal analyst at Forrester. He focuses his research on software technology, platforms, and practices that enable tech professionals to deliver prescient digital experiences and operational efficiencies. Specifically specializes in big data and IoT strategy, Hadoop and Spark, predictive analytics, machine learning, data science, AI, and emerging tech. Mike is also the leading expert on the intersection of business strategy, architecture design, and creative collaboration.
And in addition to Mike today, we have a couple folks from Overstock.com. We’ve got Chris Robison, who is the head of marketing data science at Overstock, and Craig Kelly, the group product manager for the group.
Chris is the project lead for the digital marketing and fraud protection efforts at Overstock. He has extensive experience at early-stage startups using Spark and building out data science frameworks and solutions.
Craig, as a group product manager, leads engineering and data science for the marketing group at Overstock and has deep expertise and a philosophy focusing on driving innovation through scalable, extensible systems heavily leveraged to deliver speed to market and velocity of iteration.
Okay, so just a quick review of the agenda today. Mike from Forrester is just going to kick things off. He’s going to talk through some of his research findings around top trends in machine learning and AI and requirements when selecting a platform in the enterprise. And from there, I’ll pass it over to the Overstock guys who will talk through their case study, their use case, the issues that they came across, and why they chose to go with Spark and partner with Databricks to handle their data science efforts. And then also go through what their next steps are. And then we’ll close out with a Q&A session.
With that, I will hand it over to Mike who will go over the first set of slides. And Mike, if you do have any technical issues we’re not able to forward the slides, just give me a verbal cue and I’ll move it forward for you.
Forrester Research On The State Of Enterprise Machine Learning
Okay, thank you. Well, hello, everyone. My name is Mike Gualtieri, principal analyst at Forrester. I’m really happy to be here to talk to you about the research we’ve been doing for enterprise machine learning platforms.
And I say enterprise because there’s additional set of requirements that are necessary to satisfy enterprise needs. It’s not just about downloading a Jupyter Notebook to your laptop and going.
We’re talking about machine learning. Everyone’s talking about machine learning, every enterprise is doing machine learning. But I wanted to put it in a little context in the context of AI because you often hear people say, “Oh, we’re doing AI and machine learning.” Our view on artificial intelligence is that it’s not one particular technology rather it’s comprised of one or more technologies. So to be clear, we’re not saying that you have to use all of these technologies. You don’t have to do physical robotics to actually be AI. But these are just some of the technologies.
Now, by far, the most popular and most used technology for enterprises is machine learning. Now we break out deep learning. It’s definitely a branch of machine learning but we break it out because it’s focused on a particular technique and people generally refer to that technique using neural networks as deep learning. So I’m going to talk to you about all of these things.
Now, our simple definition of machine learning is algorithms that analyze data to find models. These are predictive models, models that can predict outcomes or understand context with significant accuracy and improve that accuracy as more data is available. So machine learning algorithms create models and those models are predictive models.
Now, there are two types of machine learning. There is supervised and unsupervised and many of you are already well aware of this. But supervised is about these predictive models, unsupervised is about more finding patterns. I’m going to focus my discussion today on the supervised learning and that is the predictive models because that’s what enterprises are mostly doing with this stuff.
Machine learning platforms, as we define them, obviously have to use these algorithms but they provide the full lifecycle of tools to create these models. If we put an industry lens on this and we see all right, who’s actually adopting this, it’s definitely across the board. Our data shows that this is a hot topic among all industries. But if you force us to rank it and see who’s at the top, obviously, internet, ecommerce giants are using machine learning on a routine basis. Financial services, insurance, and then as you go down the line, there’s slower adoption.
Again, there’s a lot of adoption in digitally native companies. They’re focused more towards digital companies. It’s a lot easier to do machine learning. If you’re using machine learning in manufacturing, say, an IoT application, well, now you’ve got physical devices involved. It’s, in some ways, more difficult to adopt but it’s generally across the board.
Couple of use cases. This is the classic for machine learning is to do a predictive model to predict what customers may churn. If you can predict what customers are likely to churn, well, maybe you can do something about it so they don’t churn resulting in significant savings.
We see retailers using data about their customers to hyper personalized experiences even in the store, learning individual characteristics, predicting their behaviors, creating recommendation engines, and the world of IoT and manufacturing, machine learning models can automate decisions by predicting things like quality, predicting demand, and many other aspects of manufacturing processes.
Certainly, machine learning is used to detect fraud, any other cyber security issues.
It’s being used in connected devices. A lot of connected devices are about gathering information. Certainly, that’s been the IoT trend. Now it’s about hey, what can we do with that data in real time. Again, you need machine learning models to do that.
Life science is using it to speed drug discovery process.
Customer service is trying to predict customers’ problems before they actually occur. Again, a machine learning model.
There’s quite a few use cases out there. Every business has multiple use cases and generally, the way to find that use case is pretty simple. Walk through your business process, walk through an application, and at each step say, “Is there something I could predict here?” Because if there’s something you can predict there, then that’s a candidate for machine learning to potentially create that model.
Forrester Research On Enterprise Data Science
Now, the practice of doing machine learning is called data science. And data scientists, sure there’s many of you on this line, they explore data and they use these machine learning platforms that we’re going to talk about to create these models whether they’re about customers, processes, risks, whatever they’re trying to predict.
Data scientists know algorithms. Now, some could argue what’s… Because I’ve listed some statistical algorithms here and data scientists generally use those in conjunction with machine learning during the lifecycle process, but generally data scientists, they don’t build and write these algorithms, they use them and many of these are out in the open source implemented either in the R CRAN or scikit-learn or Spark MLlib. All of the platforms have to use these algorithms.
But these algorithms aren’t necessarily what distinguishes a good enterprise platform. It also has to focus on making data scientists and those data scientist teams as efficient as possible in this life cycle. And the life cycle roughly goes like this. We need a lot of data sources, we need to prepare that data, we need to run machine learning algorithms to create a model, we need to see if that model is going to work, and then most importantly, we need to get this into production.
Now, unlike software or code that has a distinct function, it does what you tell it to do, machine learning model is about probabilities and it can change the efficacy and the accuracy of those predictions can change over time. So an important step here is being able to monitor those models in production. This is a never ending continuous process because you want to retrain those models on new data. That’s what machine learning platforms are all about.
Some of the key challenges that we’re hearing faced by data science teams today, it’s always about getting the data. If you think about an enterprise, it doesn’t just have a couple of data sources, a few applications that has dozens, hundreds, sometimes thousands. So for any particular machine learning project, there could be a handful or more data sources that are relevant and those exist in applications, potentially external sources. So acquiring that data is still a challenge.
And as data grows and as data scientists want to use bigger data sets, now you need to have scalable training and they struggle with spiky workloads. I get that comment I made about in the enterprise it’s not just about downloading a Jupyter Notebook on your laptop and saying go. There’s so many inquiries about people running into scalability problems. They have to have a scalable solution in the enterprise.
And it’s fairly time consuming, to iterate through all the possible algorithms, all the possible parameters, and all the possible feature prep and other data hypotheses you might have.
And as data science teams grow, maintaining that productivity is also reported as a challenge and certainly deploying and managing those models in production. So that’s what we see is the key challenge.
What’s good about this market is that these aren’t surprising challenges to the vendors who offer tools in this space. Most of these vendors in one way or the other are attempting to address these and many have addressed them.
The 10 Requirements Of Machine Learning Platforms
Let’s look at some of the 10 criteria, 10 requirements that you should look for in an enterprise machine learning platform.
Number one is that data set preparation.
Data is that fuel, it’s the raw material for successful machine learning projects. So any machine learning platform has to have some features for data acquisition, connecting to data sources, and integrating that data, doing some sort of transformation on that data. Now, there’s ways to do that externally and that might be an enterprise solution.
But even that last mile, there’s always some data acquisition. And certainly every data scientist will tell you that you’ve got to create the right features. This isn’t just about bringing in columns but sometimes it’s about creating derived columns, exploring that data statistically to reduce the dimension, reduce the size of the data set.
Even with deep learning where people will say, “Oh, well, great thing about deep learning is you don’t have to do all. It extracts the features automatically.” Yeah, but what about the labels? Without labels, you’re not going to have a very accurate model. Some people dismiss that. They shouldn’t. And then I put that in this category as well.
Look at the data set preparation. Now, some of the data set preparation is not trivial. Some of it involves sorting, reordering data sets, doing massive transformation. That can just be a massive step in itself before you even get to the good stuff of training the model.
Number two is the algorithms.
Of course, you need algorithms and you need a well-rounded set of algorithms to accommodate different types of data in different use cases. What we think where most of the innovation is occurring now when it comes to algorithms is the open source. And that’s great. Very active communities in the deep learning space and even in the more traditional classical algorithms.
But when you’re looking at a platform, you should also look at, “Okay, they support open source algorithms,” and you want that because you want to take advantage of that innovation that’s happening. But what sort of proprietary algorithms? It could be statistics, it could be a combination of algorithms, it could be an implementation of how you can pass the parameters, it could be an abstraction layer on top of open source. Definitely make sure it has open source but don’t stop there.
And then finally, deep learning. Deep learning is relatively new. It doesn’t seem like it because everyone’s been talking about it for years but a lot of the platforms just haven’t tackled integrating deep learning and forcing people to download the separate frameworks like TensorFlow or MXNet and do that outside of one of these platforms.
What we’re seeing now, though, is that the open source community and the commercial vendors are starting to bring these frameworks and abstract it and make it easier to use.
Number three is scalability.
So we’ve got our data, we’ve got our algorithms, now we have to churn, now we have to be able to analyze that data. I don’t want you to just think about scalability in terms of the size of the data set. That’s critically important as well but I also want to remind you that it’s also about the iterations.
It’s very rare that you’re going to prepare a data set, you’re going to run a random forest against it, get the answer and be happy with it. No. You’re going to change the depth of the random forest or then you’re going to try a GBM.
There’s multiple iterations that occur here. That makes performance important. If you’re going to do a dozen iterations and they each take an hour or two hours, okay, you’re down a day. But if you’re on a scalable platform where those same runs can take five minutes or 10 minutes, you can iterate faster. That means you can be more productive, you can find more accurate models, and it makes an entire team more productive, not to mention, the ongoing retraining that takes place with new data to make sure the model is still accurate.
Now, you really have to… Every vendor is going to say, “Of course, we’re scalable.” But a lot of the classic vendors in this space, there was no sense of cluster computing for this sort of thing. It was always… Their scalability was multiple threads, multiple cores. It wasn’t sort of a cluster compute environment. That’s where you’re going to get the scaling.
Many of the vendors, and I’m tracking 47 vendors here, especially the new ones basically get that scalability from Apache Spark because it has Mllib. It also have H2Os on there and there’s other libraries as well.
Number four. open source.
Most of the innovation that is occurring in the machine learning algorithms, it’s happening in the open source community, and data scientists, they want to take advantage of that. Number one is polyglot programming and that just means, “I want to be able to use multiple languages.” So no, it’s not just about R. No, it’s not just about Python. It’s about both, and potentially other Lua and other languages.
So what open source programming languages does that solution support and then second big data? There’s lots of innovation starting with Hadoop and Spark. There’s other platforms on the cloud, for example, that handle big data. So what are the connections to big data? And remember, it’s also for data prep not just for scalability and training as well.
And then I’ve already mentioned the algorithms. How can those open source algorithms be exposed? It’s important because it’s really messy. Messy, messy, messy out there with all these different frameworks. So to what extent can the vendor let you use a new framework quickly but perhaps abstract away some of the complexity of some of the more well-known libraries.
Number five. The workbench.
Machine learning platforms have to provide the tools that those data scientists and their collaborators in the business use across that entire model lifecycle. When you’re looking at a platform and the tools they provide, look at the UI tools and one of those differentiated features.
I talk to a lot of data scientists. They want to be in Jupyter, some want to be in RStudio. Some even want to be in Apache Zeppelin from a notebook solution. But there’s a whole other set of data scientists who are thinking about, “Hmm, what about the drag and drop model for data pipelines?” There’s Apache Airflow, there’s some other projects coming up in the open source that add to those paradigms.
Even if you are using a notebook solution, well, what are those differentiated features? We’ll talk about those a little bit more, things like collaboration features and lifecycle features.
And then the other thing that’s becoming quite popular is automation. That lifecycle diagram I showed you? That circle that just keeps going? Well, at each stage in that circle, there’s other circles. And so the idea is, are there any tools in there that can automate building models?
A good example of that is what I mentioned before where I may want to try three different algorithms or I may want to try one algorithm with 12 different sets of parameters, do I have to do that one at a time? Or is there a way to configure it? Are there some tools that let me essentially configure it for all these runs and just say go and it runs all of them? Does the [inaudible 00:21:59] at the end? So look for automation.
You will see a lot of vendors talking about some of those automation capabilities but even at that we’re at the early stages of that. It’s going to be much bigger I think in the next 12 to 18 months.
Number six is collaboration.
Now, this is a key one in the enterprise because what’s happened… First off, enterprise have been doing machine learning for a very, very long time. But it’s been in very niche areas in the business.
It’s funny when an insurance company says, “Oh, we want to do data sciences.” You’ve been doing it for 20 years. But what they mean is they want to create models across more functions of the business and that means they want to create data science teams, and once you’re working in a team, you need to maintain that productivity. It’s collaboration among the data scientists themselves but also among those business stakeholders that have to sign off on these models as well.
Look for sharing tools, what sort of annotation capabilities, certainly sharing code, sharing models, sharing everything having to do with that lifecycle, and then also what sort of community exists.
There’s open source communities but I’m talking here about is there a way to have a community function or even integration to things like Slack that collaboration tools that companies already use. So look for that as well.
Number seven is deployment.
Obviously, these models have to find their way into an application, into a business process for them to actually start making predictions and making difference to the business outcomes. You hear a lot of people are saying, “Oh, that’s the hardest thing.” Well, there’s a technology component to that but there’s also an organizational component but I’m going to talk about the technology component to that.
A model is expressed usually in some form of code, some form of runtime. So you have to look at what are the deployment methods available. Is it a service call? That sounds like the simplest possible solution but if you’re doing a prediction inside, say, a transaction detecting fraud, well, you don’t want to make a network cop. Why? Because you don’t have extra milliseconds to do that so you may want to embed it in the code, you may want to embed it in a database, embed it in other applications. So look for multiple ways of deploying that model.
Now, that’s going to be partially governed by the algorithms and what they output but increasingly, you’ll see these platforms have ways of creating a, for lack of a better term, a jar-like component that you can then incorporate in other applications.
Number eight, and this is my absolute favorite because it’s so underappreciated, I think, which is model management.
And fine, you have the model in production. Now what? Well, we know that models have a lifespan because they’re based upon the past, they’re based upon historical data, circumstances can change. So model management is a production capability.
Model management has three key components to it. The first one is model monitoring. You have to look in a platform. What capabilities does it have? Or what capabilities can you add to your code or add to the model to monitor that model efficacy during the deployment? So what do I mean by that?
Well, in model training, you may say, “Hey, cool, I’m getting a 79% accuracy.” And that’s totally cool. But what if that accuracy starts to degrade? Well, how do you know it degrades? Or how do you know it made the right decision? The right recommendation? Say, it’s a recommendation for a product. Well, if a lot of users start clicking on that product, that’s pretty good evidence that it’s working. If they’re not, maybe it’s not working. But model monitoring is a key capability.
Now, even before you get a model into production, you better not trust it and most mature organizations don’t. No matter how rigorous the training of that model was by data scientists, you got to be out of your mind to just pop it in there to replace a model that you know is working.
Most mature organizations will do some sort of a champion/challenger or A/B testing of the model. Champion/challenger just means, “Okay, I’ve got a model out there. It’s working. I’ve got a new one I think is super better. But I’m nervous.” So that becomes the challenger and you’re running the production data through parallel and you’re comparing the results. If the challenger does better than the champion, then the challenger becomes a champion.
A/B testing is a little bit different. It’s like, “I’m going to use both these models but I’m scared so I’m only going to do 5% of the decisions or the predictions on this new model. I’m going to make sure it’s okay. And then I’ll replace it.” This is an important feature of model management.
Then finally, model lineage. Model lineage really speaks to some of the other requirements like the workbench and the data prep. How did I create this model? Can I recreate the model? You have to… It’s very common for people to say, “Oh, you have to explain to regulators.” Well, you do if you’re in a regulated industry but don’t dismiss it if you’re not because you might have to explain it to an executive how was this data created, are we using the right data, is this model good. Model lineage will trace an audit the data and methods used to do that as well.
Number nine is business solutions.
There’s a lot of common use case, and I went through a few use cases with you on the beginning here for machine learning models, and why reinvent the world. So if you’re doing a churn model, you’ll see some vendor there’ll be sample code or there might be some sort of starter projects that are for specific use cases. It’s not an absolute necessity but it’s a nice to have.
If you have specific vertical and horizontals in mind, look for the potential of solution accelerators. And these, like I said, can be sample starter models, prebuilt models. A lot of these types of things like if you’re an R programmer, you probably say, “Well, I get that in the CRAN.” Well, sure, but it might be easier if they’ve been expressed in a platform as an ML platform.
Data repositories, increasingly… This is rare. This is rare these days. But increasingly, some of these platforms will specifically have knowledge about specific types of datasets, specific formats from, say, ERP systems or other public sources. You might not find much on that but look for it. We think there’ll be a lot more platforms offering that.
And finally, number 10. What about the vendor?
I told you we’re tracking 47 vendors in this space. It’s getting confusing out there. In the last nine things I went through, I think we’ll help you evaluate and whittle down some of the vendors. But when you’re looking just at the vendor, how can I acquire this?
A lot of, increasingly, people want a cloud option and they want the cloud option because it speaks to scalability, auto provisioning, deployment, a lot of the other… It’s relevant to a lot of other things. But on the other hand, some want different pricing models. So look at the pricing, see if it works for your situation.
There’s a lot of tiny, tiny vendors in this space. A lot of them actually base some of their code on the… Not open source, but you have to look at the ability of the company to execute on its strategy and that’s particularly relevant because there are so many new vendors in this space and then how can they support the entire platform as well. That’s always an important consideration for an enterprise with the vendor.
And then finally, look at that roadmap. The space is moving at a breakneck pace. What is this product roadmap and is it going to meet the future needs? I mentioned deep learning. Okay, well, we’re not doing deep learning now but we want to do it in six months. Is that even on the vendor’s roadmap? Do they even have an answer for that? Product roadmap, critically important, as well.
The Future Of Machine Learning Platforms
Those are the 10. And then wrapping up here, just a couple of final thoughts on what the future holds really for 2018 what we think are the hottest items in terms of machine learning platforms and one is cloud.
I already mentioned that but more and more enterprise buyers, they want cloud either as the primary or at least as an option. They’re very hesitant to consider on-prem options only. And you can see the larger vendors have all been scrambling over the last 18 months to provide cloud solutions that are something more than just a wholesale move of their existing code or even re-architecting for the cloud. And certainly most of the younger startups have designed for some of them, have designed for the cloud, not all.
And then the second big trend is this automation. It was popular a couple of years ago and sometimes, it’s popular now. It’s like, “Oh, we want to allow non-data scientists to build models.” That’s the solution for the shortage of data scientists. Well, that’s more than just a fancy user interface. You actually have to hide some serious details. Pick a business user and you can give them all the drag and drop tools in the world. But then if they’re faced with picking one of a hundred algorithms to train the model, they don’t know.
Automation is two things here. One, it’s about hiding some of those details so maybe more business intelligence type professionals can build some sorts of models. They’re not going to replace data scientists. And then the other part of automation is saying, “Hey, what if we just provide automation that makes the data scientists we have a thousand times more productive?” We’re seeing a big trend towards this automation.
I think you’ll have access to these slides. So certainly, these are the 10 things that I look for when I’m evaluating and following these vendors in the marketplace and there’s a lot more detail that we could go into, but these are the 10 primary.
With that, I thank you for your attention and now I’m going to turn this presentation over to Chris Robison, who is data science lead at Overstock, and his colleague, Craig Kelly, who’s the product manager at Overstock.com. Chris, Craig.
Introduction To Databricks Customer overstock.com
Thank you very much, Mike.
So as Mike said, my name is Chris Robison. I’m the lead data scientist at Overstock working in marketing and fraud and I’m going to share with you today a little bit of our story about a particular use case we’ve been pursuing and the path that brought us towards Databricks as a solution.
First, a little bit about us. Overstock.com
Overstock.com is the premier online destination for furniture and home décor using technology to help savvy shoppers find the best prices on top sales to create their dream homes. We were founded in 1999, we have nearly 5 million unique products for sale across 160 different countries, and most important for myself as a data scientist and my team, we have billions and billions of visits and page views that have accumulated over the years.
Mike, in the early portion of his talk, alluded to data as being the fuel for all these algorithms. At Overstock, we’ve been stockpiling our fuel for nearly two decades and this is really a pretty incredible data set. We’ve actually been able to watch people go through entire lifestyle changes of, say, shopping for bedroom furniture for their toddlers to shopping for dorm room furniture for their new college students. So this gives us just a wealth of information that we can mine and harvest.
Let’s talk a little bit about the problem at hand.
Use-Case: Generating User-Scores To Rate Purchase Propensity
Our initial goal was to identify a propensity to purchase or essentially generate what we call a user score. So we’re identifying unique patterns and tendencies that indicate a user is ready to purchase or convert and the user score represents a customer’s likelihood of purchase.
We collect data at the visit and user level and the basic steps that we’ll go through throughout these slides where we first turn the raw user interactions into features, this is that first item of data prep that Mike was alluding to. We then trained classifiers on months of data with a label of purchase versus no purchase and the end goal is to predict on new users and visits as they come in.
What are some of the challenges we ran into along the way?
The first challenge we ran into is a class imbalance challenge. This is very typical of predictive classifiers in eCommerce.
So what does this mean? Essentially, we have an imbalance. So most of our sessions don’t actually end in an actual purchase. So essentially, we have much, much less of one label than another label and they can’t give explicit numbers but we say pretty consistent with the general trends in eCommerce.
So we have many new customers, we have billions of unique page views in any calendar year, as I said, many of the users we’re seeing for the first time, we have low conversion so a small percentage of sessions actually end in a purchase, and this very sparse weblog data means we have to digest just an enormous amount to generate useful features and we need some way and some scalable way to comb over all of these weblogs and start to roll things up to usable features for algorithms.
And at the end of the day, we’re interested in accuracy on the positive labels. Recall over precision. We want to… In a greedy fashion, I could make a very simple algorithm that says, “Okay, well, no one’s going to purchase and we have accuracy in the high 90s at the session level.”
The second challenge we ran into, and Mike addressed this really nicely throughout his portion of the talk, is just computational expense. Batch training and ETL takes an enormous amount of resources.
In my feature data set, at some point, I have to take all of the sessions meaning individual trips to our website for each user and order them by time. This is the single most resource intensive operation out of any of my applications. For example, some bots will have more than a million sessions for a single user.
Or inside of a session for some of these bots, we’ll see clips that are going faster than any human could possibly click a mouse. They add a layer of complexity on top of this during our peak times. Our traffic can reach 10x of normal flow so our normal resources are being stretched to their limits at the exact same time that my ETL processes need the most amount of resources and even training needs the most amount of resources.
Resources are scarce and stretched and this results in data scientists spending too much time on DevOps and not enough time iterating on their models.
In the back of your minds, you should be thinking, “We need some sort of scalable solution. And how about Databricks?”
The third challenge, and Mike addressed this with his polyglot language support. There’s just multiple programming languages. So there’s a wide range of programming language preference amongst my data scientists. The top three are probably Python, R, and Scala. And the language choice is really predicated on use case.
You want to use the most efficient language to prototype models and algorithms on smaller data sets. Oftentimes, especially for data scientists coming out of school that’s Python or R then moved to a more horizontally scalable language for production which we’ve chosen the route of Scala, although there are a lot of options here.
Another layer on this is, in the exploration phase, you want a language that’s very visually rich. You want to be able to ask different questions or address different hypotheses about your data sets, be able to visualize your problems and the solutions to those problems, and then be able to share those around with your business partners. So for exploration, analysis, and feature engineering, you need a robust statistical visualization framework like R or like Matplotlib in Python.
So the results oftentimes is large organization where everyone are using their own notebooks whether they be Jupyter Notebooks or Zeppelin Notebooks, I think there was a flash in the pan of a Scala notebook for a little while. But these create silos across the data science organization and it causes for a lot of repeated code. At the end of the day, I never want my scientists to have to write a custom package to do some exploratory analysis if I can write it once and then add to a larger code base so that they can just pull down those modules, use them for their task at hand, and then move on to modeling.
Scaling Data Science At overstock.com On Databricks
This brings us to data science at scale with the Databricks unified analytics platform.
How did we conquer some of these challenges I talked about?
The first being the class imbalance challenge. We took a greedy approach at first that ended up working out really well and we just simply oversampled our training data. There are more nuanced approaches you can take with synthetic oversampling, hierarchical sampling. We took a first approach and it ended up working out well enough for our purposes.
We have a Spark job running on Databricks which splits the ETL training data. Since the positive labels are very sparse, small percentage of visits, we find all the positive labels and a relatively equal number of negative examples to train on. You can think of this split as a hyper parameter that you can tune. Do you want 70/30? 60/40? You can play around with this as you’re going through your model exploration process and it greatly increases your accuracy metrics.
In a sense, we’re making our models overly sensitive to these positive labels that we want to identify.
The next challenge we ran into was the feature design and ingesting these raw weblog data. So each action on a site for each individual is a single entry in our weblog whether you’re clicking a button, choosing a color swatch, changing between different options for some of the product, maybe different colors in couches. Each one of those interactions is a single entry.
So first, we combine all of these events into multiple sessions for each user. You should be thinking along the lines of a combined by key in Spark. And we then combine all of the sessions for each user.
So for each user, I’m going to collect a sorted set, say, all of the sessions for each user and then at some point, we’re going to want to order all of those sessions so that we can generate empirical counts of specific actions and interaction times. So these are lab features in the classical time series sense.
How long ago did individual X come to this site? Did they look at similar taxonomies? How long were they on this site? Are they showing us different behavior than their typical just window shopping behavior?
This brings up again our most expensive action which is ordering all of the user sessions. Once all the sessions for each user are corrected, we sort the sessions by time. Again, this was extremely expensive. Then we start calculating time lags between sessions and individual actions and we make histograms out of more important actions.
At what time of the day do our users tend to add to cart? To remove to cart? At what time of the day are they viewing different taxonomies or different subcategories? We’re going to try and embed those patterns into our feature set.
Essentially, we’re just trying to capture the difference between normal shopping behavior and behavior that indicates a customer is ready to purchase.
So when do you usually shop? For me, a lot of the time, I may be browsing on my cell phone while I’m at work, while I’m bored in a meeting, but I tend to convert later on in the evening and oftentimes, I actually switch devices when I convert. I’m more likely to convert on my tablet or my laptop simply because I have bad eyesight and oftentimes, I can’t see that resolution on my cell phone screen.
We want to embed into our feature set enough information that algorithms can answer the questions, when do people usually shop? When do they purchase? What device do they prefer in each case? When are they going to start removing from carts? Are they a customer that doesn’t use our wish list functionality but they tend to just pile up a bunch of items in their carts until they’re ready to convert and then they start removing those items to get down to a price point that they’re comfortable with?
And then we look at lagging windows of all of these features. We look at say, one, seven, 14, and 30-day windows of above and these lagging windows are, again, a hyperparameter that you can play around with if you have a robust enough code base.
So once we take all of the sessions for each user and start to order them, we want to pull in external data sets whether it be census data or just other datasets that we have on Pyramid Overstock, we have a priority club membership, we can look at returns data, we can look at interactions with our customer service.
For any given session in time, how long ago did this user make a return? Are they a priority club member? If they are a priority club member, how long ago did they start? Have they changed their password recently? If they did interact with our customer service team, was that a successful interaction? Did they get a resolution in a timely manner?
We want to embed all of this global information about our users into these feature sets. And to do this, we utilize Spark Snowflake Connector with a query push down and it makes these really complicated joins and aggregations efficient across millions of users per day.
I really can’t drive home this point enough. For each given session in time, we’re attaching this global information in a proper way. So a user may become a Club O member, but then stop their Club O membership and we need to embed that and attach it to the correct session. You can picture just these massive, massive joins across millions and millions of rows and these large data sets.
Here’s a diagram of what our general ETL pipeline looks like and again, we’re doing all of this from prototype to deployment in Databricks. So we take our raw web logs, we use a combined by key to roll those up into sessions, we then use another combined by key to generate user sessions or attach all of the sessions for each user.
At that point, we have to flip the switch and order all the sessions by time. And again, this is very, very expensive. Once we have all of the sessions ordered by time, we can join in the global information to generate our enhanced user profiles. So priority club, membership, returns, customer service, anything that we can really pull in.
Then we can generate temporal and lag features. So how long ago did someone make a return? Are they a frequent returner? Do we have priority club information about this user? And that, at the end of the day, generates an enhanced user profile that we ingest into our algorithm as a final feature set.
Machine Learning With SPARK
Now on to the model training side of life. Again, we do all of our model training in the Databricks unified analytics platform. We actually wrote a custom module in Spark that allows us to cross validate three or really K different algorithms for any given task.
An example may be using logistic regression, random forests, and naive Bayes. We then use Spark hyperparameter tuning to find the optimal parameter set for each of these algorithms and we provide custom evaluator classes to Spark so that we can choose the metric which Spark then chooses as the best parameter combination for a single algorithm. And at the end of the day, we want to cross validate across all of these algorithms.
We not only want to choose the best hyperparameter set for each of these algorithms but we want to choose the optimal algorithm across the three or the five or the 20 algorithms that you may have plumbed in for any given problem. This is getting more towards that automation and scalability. There’s no silver bullet for any of these tasks and so why not try all of the algorithms that are provided to you? Why not try customizing some of those algorithms once you’ve done your initial training runs and you have a better idea of your space?
At the end of the day, we need to have really robust reporting and transparency into these automated processes. At the end of each algorithm run and each cross validation of each algorithm, we produce a report that gives us a high level model description. What version of the model am I using? What’s the name of the model? What are the features that are being fed into the model? Where it’s appropriate? What are the coefficients for those features?
What are all of the accuracy metrics and by all of the accuracy metrics, I mean you want to literally output everything that’s available to you and then that’s where we, as scientists, have to look at those combinations of metrics and decide how to make your future iterations and keep fine tuning these models.
And then we want parameter descriptions, default runtime settings. Am I overflowing the memory on some of my nodes? Is my job running longer than it has in the past? That might mean maybe we’re seeing a spike in traffic, maybe there’s a leak in a process somewhere but you want as much visualization and introspection as possible.
Here’s a diagram of what our model training flow looks like. Again, all built on Databricks. We featurize 30 plus days, we feed those features into a splitter, we then train a model that we’re going to ship off to some sort of prediction harness, and then we refeaturize on single days of data or single hours of data, whatever your time unit is there depending on the problem and make predictions.
The Business Impact Of Databricks At Overstock.com
So why Databricks for unified analytics?
The first thing for us especially in marketing, especially in eCommerce, is speed to product. Databricks allowed us to close the gap between POC and production. My data scientists are POCing algorithms in notebooks on scalable data sets that we can then ship off to more production environments in a drag and drop way.
- We decrease the cost of moving models to production by nearly half.
- We stand up new models at one-fifth of the time previously required.
- We can make intraday improvements on existing models without new deploys.
- We quickly spin up and down clusters through self-service and cluster management.
- This means actionable insights when our business partners actually need them.
- And the in-notebook version control and collaboration allows us to rollback single moves inside of a notebook.
I can log in and look at one of my scientist’s notebooks, help them debug their code, and then rollback versions if we maybe try something that doesn’t work and all of that is preserved. They’re strong, good integration and it makes exploration in general trial and error approach to exploratory analysis totally seamless.
The next big thing is elastic compute. This addresses those spikes in workflows that Mike was talking about.
Elastic and scalable compute allows for fast iteration during model development. It shortens the time to complete exploratory analysis which traditionally can take tens of hours, if not days. You can think of lifetime into exploring extremely large data sets.
The server-less solutions available allow for efficient use of our cloud resources for non-mission critical analysis. Maybe I have some crazy hypothesis that I want to explore on a very large data set. That’s going to require me to comb through all of the weblogs for a couple of months. I can deploy that job on a server-less solution that’s going to run in the background in a really cost efficient way because I don’t need the answer right now. I’m fine with getting the answer in a few days or even a week.
And simply put, the maturity in networking, security, and distributed computing on Databricks on the larger AWS platform is second to none.
At the end of the day, we just want to pick the right tool for the job. Python, and especially Python 3 and R are much more robust in Databricks. Databricks supports both versions of Python 2 and 3 so it takes care of that backwards compatibility issue that I’m sure a lot of you are running into if you started out as a Python 2.7 shop.
There’s a full suite of libraries for both environments and they can be installed on the fly. So again, notebooks can switch between Python 2 and 3 inside of a notebook. I can do some data visualization in R, ship it off to Python to use their ETL and manipulation, and then do some large scale things in Scala. You can do it cell to cell.
We can install Python in our libraries as we need them on the fly which enables us to prototype new libraries without requesting external support. It’s a point and click operation that goes on for individual scientists. We also have the ability to push internal code base to the notebook clusters which allows for customization and less code reproduction for common tasks. So getting away from silos of individual notebooks and individual code bases. We’re all building off of each other’s work.
They provide the legacy support for languages and frameworks. This gets back to that model management points that Mike was really harping on and I can completely agree that it’s often one of the most overlooked points for a robust data science pipeline and one of the hardest to achieve. So we have full support of historic versions of Spark, Scala, and Python which allows for complete reproducibility of models based on older versions.
We’re able to snapshot full datasets and connect them to models allowing results to be completely reproducible at any time in the future and this takes care of issues. Models changing. Do we want to rollback to something that we tried six months ago and see if that model is still relevant? So you have completely plug and play functionality. And the combination of backwards compatibility with reproducible results make for a robust long term data science and modeling environment.
The end goal should be completely reproducible models and results which means that any mistake can be reversed and work lives on long after my data scientists moved on to their next opportunity.
Our experience with Databricks has been very rich. They become a partner in innovation and success. The data scientists and mostly our business partners at Overstock are naturally greedy individuals. We want all of the data all of the time and all in near real time. So the account reps and engineering teams at Databricks welcome this challenge and they’ve really stepped up to the plate force. They’re incredibly responsive to our specific needs.
I can count numerous times where I’ve had both our account reps and support reps on the phone within minutes of running into a problem helping us troubleshoot in real time on our production systems. They continually push new features whether it’s an API to produce containerized models that can be deployed on scalable production servers or multiple actors support for simultaneous reading and writing built on top of Spark.
However you look at it, Databricks is invested in our success and excited to push the boundary of what’s currently possible and it’s developed into a really rich relationship that allows us to build the future faster.
Again, speed to product is the name of the game in our industry. We have to be on the leading edge of our competitors and delivering our customers what they expect. We achieve inter-day iterative modeling, multiple language support, a unified analytics platform. We do everything from ETL to exploratory analysis to deep learning. My product managers even use the Databricks notebooks to help visualize KPIs and turnaround business reports.
Elastic compute allows us to use the latest tech with the latest hardware all the time. It’s instantly scalable with powerful automation features. Strong collaboration tools have really improved our productivity and allows us to spend more time thinking deeply about the models we’re producing and allows for transparency. Git integration allows for robust code tracking and all of this combines to allow us to achieve an N+1 data model in the cloud.
Again, we can take all of our internal data sources, publicly available data through the census, and then data through our other third party vendors and congeal it all in one space.
I think one of my data scientists put it best. Working on our new cloud stack is like getting a seat in first class. It’s just the way flying or data sciencing should be.
Thank you very much. I’ll now turn this back over to Wayne for some Q&A.
All right. Thanks, Chris.
Quick question here for the Overstock guys. Can you talk a little bit more about how you move development to production on Databricks?
A lot of the… I guess it all starts in terms of development in the notebook where we have our data sets available through Snowflake in the cloud, we can start to POC algorithms on smaller data sets, maybe a week or a couple of weeks’ worth of weblog data, and once we started getting some juice and seeing the metrics we want, it’s pretty seamless to then deploy those whether we’re then wrapping them in Scala in a jar and deploying as natural Java on Databricks but run over weeks and months of data and start to monitor them in real time.
It makes it deceptively easy to move from POC into production and close that last mile.
Right. Thanks, Chris. Thank you again to our presenters for making the time to share their findings and experiences and thanks everyone for attending.
Here’s more to explore
Get a deeper dive into MLOps:
Discover our Solutions Accelerators for industry-specific use-cases:
Check out the industry’s leading data and AI use-cases: