The Critical Missing Component in the Production ML Stack

May 26, 2021 03:50 PM (PT)

The day the ML application is deployed to production and begins facing the real world is the best and the worst day in the life of the model builder. The joy of seeing accurate predictions is quickly overshadowed by a myriad of operational challenges. Debugging, troubleshooting & monitoring takes over the majority of their day, leaving little time for model building. In DevOps, software operations are taken to a level of an art. Sophisticated tools enable engineers to quickly identify and resolve issues, continuously improving software stability and robustness. In the ML world, operations are still largely a manual process that involves Jupyter notebooks and shell scripts. One of the cornerstones of the DevOps toolchain is logging. Traces and metrics are built on top of logs enabling monitoring and feedback loops. What does logging look like in an ML system? 

In this talk we will demonstrate how to enable data logging for an AI application using MLflow in a matter of minutes. We will discuss how something so simple enables testing, monitoring and debugging in an AI application that handles TBs of data and runs in real-time. Attendees will leave the talk equipped with tools and best practices to supercharge MLOps in their team.

In this session watch:
Alessya Visnjic, CEO, WhyLabs, Inc.



Alessya Visnjic: Hello everyone. Thank you for coming to this session of the Data and AI Summit. My name is Alessya Visnjic, I am the CEO of WhyLabs, the AI Observability Company. Today I would like to present an open source project that fills the huge current gap in the production machine learning stack. Here is what I would like to cover in the next 20 minutes. We’ll start with looking at the machine learning stack and discuss what’s missing. Spoiler alert, there are many things missing from the production machine learning stack, but today we’ll focus on what I believe is the most critical missing component, data logging.
I’ll present ideas on how to design a logging solution for data and machine learning applications. Then I’ll present an open source library that implements these designs and show you how to plug this library into Spark and MLflow. Finally, I’ll talk about the most common use case this library addresses. We’ll conclude with a Q&A. Let’s go.
ML applications are delightful when they work and dreadful when they don’t work. If you ever supported them machine learning application before, you probably have a story of how it failed and upset your customers, just like Andreas here who got recommended a pepper as a substitute for roses by the Whole Foods algorithm. It’s a harmless failure, but a frustrating customer experience. What’s even worse is just imagine what it takes for the AI practitioner on call to troubleshoot this behavior.
Debugging such undesirable model behavior is so frustrating for AI builders, it makes you feel like you’re trapped in a Kafka novel. Furthermore, even the best and most accurate machine learning applications have pitfalls. Every and any model struggles in the wild, because the data that represents the world is always changing. To quote one of my favorite decision scientists, Cassie from Google, “The world represented by your training data is the only world you can expect to succeed in.” So no matter how good your model is, one day you’ll find yourself troubleshooting and debugging it.
Now, debugging ML applications is unlike debugging traditional software. Building and serving machine learning models involves processing massive amounts of data. The entire Databricks ecosystem is specifically designed to power terabyte size data pipelines. If we look at the four main steps in the ML stack on the slide, processing raw data, generating features, training the model and serving the model, each one of these steps involves transforming and moving massive amounts of data. Therefore, each step can either introduce a data bug or it can be completely derailed by a data bug.
So the majority of machine learning problems come from data, from things like missing data, duplicate data, bad quality data and data drifts. Now, think back to the first time or the last time you debug a poor performing machine learning application. You check the code, the model version, the hyper-parameters, the infrastructure, only to realize that the problem is actually in the data. So how do you debug data? How do you test it? How do you monitor it? How do you document it?
This is not a novel problem by any means. Every engineer who operates machine learning applications and production has their own improvised way of testing, monitoring and debugging data. It’s probably ad hoc, it likely involves pulling a lot of data from storage, writing complex ETL queries, processing this data and writing visualizations in your favorite notebook. This is an incredibly tedious process.
Imagine if instead of running this tedious manual process every time your model fails, there was a way to continuously monitor the quality of data flowing through each stage of the machine learning stack. Imagine if you could be alerted about data issues such as data drifts, schema changes, data outages and so on before they derail and impact your model. Today, I would like to present an elegant solution that makes the lives of machine learning practitioners easier.
This solution streamlines your debugging testing and monitoring activities, and the solution is data logging. Similar to application logs or infrastructure logs, each step in the machine learning stack should generate a data log that captures metadata and statistical properties of each batch of the data processed over time. Let’s say we wanted to design such a data log. What would this data log have to capture in order to provide the most utility to the AI practitioner?
In my experience, working with dozens of machine learning teams, there are five core categories that are most crucial to track for understanding how good your data is. For example, tracking metadata allows you to understand where different subsets of your data are coming from and how fresh they are. Tracking counts allows you to ensure that the data volume is healthy and allows you to keep an eye on missing values. Tracking summary statistics allows you to identify outliers and data quality bugs.
Tracking the distributions enables you to understand how the shape of your data changes over time and to catch distribution drifts. And finally, you can optionally also track a small stratified sample of the raw data to help with debugging or post hoc exploratory analysis.
Now, since logging is going to be part of every step in the machine learning pipeline, we need to be thoughtful about the runtime properties and the outputs of this logging system. So what are the key properties of a good and useful logging solution? Well, here we can take some cues from the DevOps best practices, and a good logging solution should be lightweight, it should run in parallel with the main data workloads. A good logging solution should be portable, it should be easy to plug in anywhere in your pipeline.
A good logging solution should be configurable. You should be able to configure what it captures based on your use case very easily. And since we’re working with data statistics, the logs should also be mergeable to allow merging them for multiple instances in case of distributed workloads and to aggregate logs over time into batches, hourly batches or daily batches.
Well, after many years of supporting production machine learning applications, my team and I at WhyLabs set out to build a data logging solution that satisfies all the criteria I just described. Today, I’m thrilled to present you whylogs. whylogs is a purpose-built machine learning logging library, open sourced by our team at WhyLabs. The library is built to provide lightweight, portable, configurable and mergeable data logs for both batch and streaming data workloads. Check out the Bitly link on the right side to get to the GitHub package.
whylogs is already making the lives of AI practitioners all over the world much easier. It’s available in Python, Java and Scallop under Apache 2.0 license. Now let’s look at how whylogs works and how it satisfies the design requirements we discussed earlier. There are many, many open source libraries that tackle the problem of data quality testing. Packages like DQ and Great Expectations are very popular. Now, whylogs takes a very different approach from those packages. It implements a standard format for representing a snapshot of data.
This allows the user to decouple the process of producing data logs and the process of acting upon them. whylogs provides a foundation for profiling, testing data quality, monitoring, and so on in the completely decoupled way. whylogs elegantly integrates with Spark because we are working with massive data workloads, and the particular example we’re looking at here, we’re taking a popular public dataset about wine quality and profiling each feature whylogs. All of this is done in a few lines of code and behind the scenes, whylogs uses stochastic streaming algorithms to sketch key statistics of each feature.
whylogs seamlessly integrates with the Spark dataframe API, which allows it to provide a little to no configuration set up for the user. The whylogs UDFs are implemented in Java, taking full advantage of Spark’s raw processing power. whylogs requires only one pass over the data to calculate all of the statistics, so the performance of such computation is so efficient that you can run whylogs in parallel with your main data workloads. The resulted output can be stored in parquet for post-processing and analytics.
Now, for even more efficient computation, whylogs can be also set up to run as a Spark accumulator. This allows users to just watch a data stream without having to double scan it. This setup can be easily hooked up into existing Spark workflows. Furthermore, taking advantage of the PySpark API bridge, you can enable access to this amazing performance from Python environments. And this setup is especially convenient for data scientist’s workflows that are typically done in Python.
Now, if you’re already using MLflow to manage the lifecycle of your model, you’re already storing valuable artifacts or logs associated with each run, things like model parameters or performance metrics. whylogs will add a data log artifact to each job with just one line of code. Similar to other MLflow tracking APIs, whylogs profiles and collects the data in just one line of code. Using whylogs with MLflow adds something like a data transparency across the entire machine learning life cycle. The resulted artifacts can be visualized in MLflow UI or they could be visualized using the whylogs built-in visualizations.
Furthermore, the resulted artifacts can be used to continuously monitor data quality and debug data by running something like a quick exploratory data analysis on the data in every run.
Here’s a sample partial output of whylogs on a wine quality dataset. whylogs captures key statistics per feature, things like counts, distinct counts, top key values, summary statistics and detailed histograms. The default configuration in first, the schema and calculates about 30 metrics per feature. The log files are stored in parquet, port above JSON or flatten CSV, as we see here in this table. This data log contains all information you need to understand the quality, integrity and distributions of the data in your pipeline at every batch. The type of metrics whylogs captures is configurable and we would love the community to contribute new metrics and ideas.
Now, collecting whylogs systematically enables fantastic transparency into the data that flows through the ML pipeline. In the wine quality example, I used MLflow to run model inference for 20 batches. I used whylogs to capture the data properties of every feature in the batch. Now we can visualize the distribution of, for example, free sulfur dioxide in the slide across all the batches, across the 20 batches I captured. Visualizing the distribution over time helps us understand distribution drifts or identify data bugs.
Here in this particular example, we can see that there’s a spike in the middle right of the graph. And if this data was crucial to my model, I might want to take a look at this particular batch of data to make sure that there are no problems with it.
Now, one of the most important properties of a logging setup is to be lightweight and hence cost-effective. Capturing data statistics should not involve moving massive amounts of data and post-processing it. whylogs use a stochastic streaming algorithms to capture remarkably lightweight data statistics. This approach ensures a constant memory footprint as we can see here in these benchmarks and an ability to log terabytes of data without breaking the bank. The output scales with a number of features.
Furthermore, the whylogs profiles are mergeable, so for streaming pipelines, you can capture micro batches and aggregate them into hourly or daily batches, very convenient. The resulted output is extremely lightweight and privacy-preserving because it does not store any raw data, it only stores data statistics.
Now, if the logs are used to catch distribution drifts, capturing the distribution accurately is incredibly important. For large scale systems, it has been common practice to sample a percentage of data and calculate distributions as a post-processing step. This is obviously not an accurate method, especially for non-normal distributions. whylogs profiles 100% of the data to estimate the distribution, and to show you the difference in accuracy, we compare the errors that resulted from calculating a distribution from a profiling approach versus calculating the distribution on samples data.
Here, you can see a comparison in errors. As you can see, the errors from profiling are significantly lower than those generated from sampled data. Furthermore, distributions captured by whylogs can then be used to set up a stratified sampling approach to capture a more faithful sample of raw data for debugging purposes if you need to do that.
Going back to the initial conundrum of testing, monitoring, debugging and documenting data. We developed whylogs initially to capture the data necessary for monitoring data quality and data drifts and model performance. When we open sourced the library, the AI community found additional ways of using the whylogs outputs. For example, whylogs can be used to unit test your data. To do that, first, you generate a whylogs profile on the batch of data that you would like to use as a baseline or as a reference set. Say you’d like to use training data, then you extract constraints from this whylogs output.
And these constraints can then be applied in GitHub actions to check new data at every commit. During every commit, whylogs profile will be generated on a new set of data and you can use GitHub actions to unit test every feature. This is especially useful for unit testing complex data transformation pipelines, which are one of the most common sources of data bugs.
Now, the most common first use case is to monitor training-serving skew, which is a classic machine learning challenge. To send that up with whylogs, first, you log the training data with whylogs and that training data will be used as a reference or a baseline. Then you integrate whylogs to capture data during the inference step of your model in hourly batches or 10-minute batches, depending on the volume of data your inference takes. Each log file captured during inference can then be compared to the baseline that was established by the training data log file. Now, data drift can then be identified by calculating KL divergence or Hellinger distance between the training distributions of each feature and the inference distributions of each feature.
To sum up, more and more data science teams are using whylogs to enable operational machine learning the activities. whylogs has integrations to make it easy to plug them in into any Spark pipeline, any Python or Java environment and any of the popular machine learning frameworks with just a few lines of code. Capture whylogs profiles during the raw data ingestion step, during feature transformations, during model training and during model serving everywhere in your machine learning stack. By systematically capturing and storing such log files, you create a record of data quality and data distributions along the entire pipeline.
By comparing log files from different places in the pipeline, you can build data unit test, data quality monitoring solutions, performance monitoring and debugging dashboards. Each of the log files can be compared across all of the features that are captured by log file to another log file anywhere in the pipeline. This elegant open source library does not involve changing the architecture of the current workflow or changing the workflow that the data scientists use every day. Just a few lines of code add powerful transparency and auditability to your AI applications and create a wide range MLOps and enable a wide range of MLOps activities.
Check out the Bitly link to the right of the slide to get to the GitHub package. Try whylogs and be amazed at how much easier it will make your day-to-day at ML operations. Please send us your feedback, contribute, help us build integrations into your favorite data tools and help us extend the concept of logging to new data types. Currently, whylogs works for structured data and images, and we have a big roadmap of other data types to work on this year. Join with us the effort of building the new open standard for data logging. Thank you.

Alessya Visnjic

Alessya Visnjic is the CEO and co-founder of WhyLabs, the AI Observability company on a mission to build the interface between AI and human operators. Prior to WhyLabs, Alessya was a CTO-in-residence ...
Read more