This is a guest post from Chris Robison, Head of Marketing Data Science at Overstock.com.
At Overstock.com we’ve never had a problem with a lack of data. At 19 years old, we have one of the most expansive user datasets in all of e-commerce. As Lead Data Scientist in Marketing, I can look back through our user records and see people progress from having toddlers to having college students. Through buying patterns, we can see individuals transition from furnishing their first flat to furnishing their dream home. This means we have TONS of data and processing it at scale requires some serious firepower: queue Databricks and a managed auto-scaling, cloud-based Spark infrastructure.
The Problem to Solve
The initial problem we were trying to tackle is simple enough on the surface: how to quantify when an individual is close to making a purchase. In other words, how do we generate a score for each customer that represents their likelihood to purchase? With nearly 5 million products for sale onsite and billions of visits and page views in our historic web logs, sifting through the massive amounts of sparse data to construct adequate features and identify useful signals in a never-ending stream of cart interactions, page views, and product attribute selections proved to be an enormous computational challenge; it’s the type of dataset and computation that DevOps nightmares and Data Science dreams are made of.
The live event stream coming from the site is made up of individual log entries for every action a customer performs. The distinct actions then need to be rolled up to individual web sessions for each user. This sounds simple enough; however, even this step can be challenging. Overstock.com has billions of unique page views in a calendar year, and we are seeing many of the users for the first time. Moreover, as with any e-commerce company, only a small percentage of sessions end in a purchase. Thus, we face a severe class imbalance problem and focus on recall over precision for our accuracy metrics.
Our next step is to combine individual sessions into collections of sessions for each customer. We then must order these sessions by time. This single call proves to be the most expensive operation in our chain of apps. Some bots will generate more than 1 million sessions for a single unique ID. Sorting forces all 1 million sessions into the memory of a single worker on the cluster, overloading the memory and killing our jobs. Moreover, during peak time our traffic can reach up to 10 times normal flow. This creates a perfect storm of resource consumption: on-site resources become stretched thin at the same time that our jobs are consuming record amounts of resources for ETL and model training. This in itself presents another massive challenge: no matter how amazing the models I create are, site up time is always the most important factor in choosing what to deploy. Thus, at the times when we have the most opportunity to leverage our models, those models have the highest chance of being killed due to lack of resources. This results in Data Scientists spending too much time on DevOps and not enough time iterating on their models.
In order to deliver on the promise of a personalized shopping experience, we chose to partner with Databricks as our unified analytics platform to more efficiently build, train and deploy machine learning models.
Figure 1: Data flow at Overstock.com
The first problem we tackled was feature generation and ETL. We approached the bots overloading memory issue through exploratory analysis of typical user behavior and ultimate filtering of the initial data pulls. Basically, we filtered out actions that were happening so fast that they couldn’t reasonably be attributed to a human (i.e. must be bot traffic). Once we trimmed out these very large session counts, we successfully sorted the sessions for each customer and started to generate some features such as histograms of hit times, time between session with specific cart actions, product views, etc.
Figure 2: ETL Workflow in Databricks
Our goal was to identify a large collection of features that could be used to answer these questions: when do you usually shop, when do you purchase, and what device do you prefer to use in each case. Through our exploratory analysis of the session data we noticed some interesting trends in shopping behavior. People tend to window shop during the day while they are at work. However, they tend to wait until the evening to make purchases, especially large purchases. Moreover, our customers tend to shop on mobile devices, but convert on desktop.
Ultimately, our goal was to identify typical shopping behavior from purchasing behavior. To achieve this goal we utilized 1, 7, 14, and 30-day look-back windows for all of our empirical features and histograms. This generated a rich signal that showed changes in behavior as a customer moves closer and closer to purchase. With some initial testing and cross-validation we realized that the web-log behavior could be enriched by identifying what a customer’s state was for any specific session, i.e., for a given session in time how long ago did this user make a return? Are they a priority club member? If so, how long ago did they join the club, or cancel their membership? However, the joins between our sessionized data and various customer information tables presented another set of computational issues. We leveraged Spark’s Snowflake connector with query pushdown to make these complicated joins and aggregations efficient across millions of users per day.
Once the features are generated, model training using Spark on Databricks proves to be very straightforward. We enhanced Spark’s cross-validation method to allow for cross-validation and hyper-parameter tuning inside each of three algorithms, then picked an optimal algorithm and parameter set out of the three. Our cross-cross validation method allowed us to test several models for each retraining and constantly promote the best preforming model based on new data. Next, we leveraged the multi-language support in Databricks to generate rich reports and visualizations to describe each retraining and execution of our model. Reports include high-level model descriptions and versioning, visualization of classifications and underlying data distributions, classical statistical tests for changes in those distributions, metrics, parameter descriptions, and default runtime settings.
Figure 3: Model training workflow with Databricks
The most important attribute of any solution is speed to product. Databricks allowed us to close the gap between proof of concept (POC) and production. The last mile of productionizing models at scale is the most painful part of traditional deployments. Databricks allows us to POC and productionize models all in the same environment on the same datasets. Most importantly, our Data Scientists can do both in R, Python, or Scala, allowing for true flexibility across a variety of libraries and toolkits in addition to native Spark. We saw the following benefits:
- Decreased cost of moving models to production by nearly 50%
- Stand up new models in 5x the time previously required
- Allows for intra-day improvements on existing models without new deploys
- Quickly spin up/down clusters through self-service, cluster management giving us actionable insights when our business partners need them
- In-notebook version control allows to roll-back single moves inside a notebook making exploration and general trial/error approach to exploratory analysis seamless
Elastic compute/scalable resources allow for fast iteration during model deployment. We can test out ideas and features sets across multiple clusters. This shortens the time to complete exploratory analysis which could traditionally take tens of hours if not days. Serverless solutions allow for efficient use of cloud resources for non-mission critical analysis. We run less critical jobs in the background while production jobs still have the full computational resources required.
Databricks allows us to choose the right tool for the job. Python (especially Python 3) is much more robust in Databricks than in other notebook environments. Various legacy jobs run in perpetuity with full historic support of all 3 major languages. The full suite of Python/R libraries are available to install on the fly without requesting external support. One of the most attractive features for an established Data Science team is the ability to push internal code base to the notebook clusters, eliminating code reproduction and allowing for full customization.
Ultimately, Databricks is a partner for Overstock in innovation and success. The Data Scientists at Overstock are naturally greedy individuals. We want all the data, all the time, and all in near real-time. The account reps and engineering teams at Databricks welcome this challenge. Throughout our relationship, they have been incredibly responsive to our specific needs and are continually pushing new features/updates that integrate seamlessly with our existing workflows. Mary Clair Thompson, Data Scientist at Overstock said it best:
“Working on our new cloud stack is like getting a seat in first class. It’s just the way flying (or data science-ing) should be.”