Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
Speaker 1: Hello everyone, and thanks for joining me today. My name is Cory and I’m an engineer with Verta. I recently moved from the world of APM, into machine learning operations and monitoring, and I wanted to tell you a story about what I’ve learned along the way.
I’ve been a software engineer for more than two decades now. I’ve had the opportunity to work across the industry. I joined Verta in November 2020 after a long tenure in the APM industry, where I worked inside some of the largest data pipelines in the world. This spanned the mobile market monitoring tens of millions of iOS and Android applications, and routing their application crash reports to the developers. I worked on custom dash boarding tools and operating and maintaining pipelines that track the states of millions of connected data agents. Eventually my work in APM culminated in the development of the backend of the product known as Lookout.
This is a system that analyzes thousands of streams of data and correlates deviations across them to highlight hotspots. The time I spent in APM gave me a unique and detailed view of the modern SAS landscape, and how the developers and users of these systems engage with them. Now I work for Verta. We’re an end-to-end ML operations platform. We provide MOV delivery, operations, and monitoring. We serve production ML workloads at top technology, finance and insurance companies. I joined Verta because my knowledge of APM and production SAS systems could be applied to solve a new category of problems. Data scientists, and machine learning specialists are facing many of the same issues enterprise software has faced in the past, but with new and unique challenges specific to the domain.
Today, I want to tell you a story about why I work on ML monitoring. To tell you that story, I’m going to have to go backwards first, and explain what I even mean by APM. Then, I can talk about what I’ve learned about ML monitoring that makes it so different. Why that problem was so interesting that I needed to pursue it.
Please, provide feedback to our hosts. They greatly appreciate it.
I’ve used this term a bunch already, and I’m going to use it a bunch more. What is APM? I came to monitoring from the embedded and mobile world. I learned about APM because I was using New Relic’s mobile monitoring product. And, when the team that maintained it was hiring, I had to join. The product is really cool. It gave developers so much power. It was the type of thing I wanted to work on. I learned that APM is shorter for application performance monitoring. It’s a class of monitoring focused on the performance characteristics of production software systems. APM data is metrics. That’s a combination of a name, of values, some labels, and a timestamp. APM systems are designed to measure and store vast quantities of metrics, and then provide a simple, fast way to get it back.
Now, the more I worked with metrics, the more I realized that raw metrics are actually rather limited in their usefulness. Their power comes when you aggregate them over time. And that inspired me to work on dash boarding, because making sense of this data requires the ability to take it and convert it into something you can visualize. I supported production systems and I relied heavily on alerting to tell me when something was wrong. I learned that the alerts are the most powerful part of any monitoring system, because they can inform the owner when action is required. I would also learn that my APM tools were invaluable to me. Without them, doing my job efficiently and effectively, it was pretty much impossible. The time I spent in APM taught me a lot about what is important. Monitoring data lets you know the system is alive and that it can and is properly processing requests.
Now, if the system is up and it’s handling requests, we can talk about performance. We need to know how long each tenant request takes and how many the system can process in a unit of time. These are response time and throughput. Now we can also measure the error rate or the percentage of requests that fail. When we combine availability, throughput, response time and error rate, we get what’s known as the golden signals. This is a term that was going, coming to, by Google. It’s not been adopted by the entire industry. When any of the things deviate away from the limits that I define, I need to be told. APM tools are at their most valuable when you can take action on their outcomes.
I led a team of five engineers that supported a high volume data processing pipeline for two years. The APM data produced by the system, we supported, allowed us to operate it. Without it, we would have been blind. Seeing when an error occurred in production and tracking it back to the service and function that caused it, is something we did on a daily basis. Without the depth of information provided by the APM tools, this would have been impossible. My team supported mission critical systems, so our APM data was connected directly to our critical incident response cycle. And as someone that’s been on call supporting production systems for 10 years, I can attest to the immediate value of an APM tool when things go sideways. When I get paged, I want to go straight to the dashboards and traces that will get me to the root cause, so I can fix it as fast as possible.
My time supporting production systems has also introduced me to another side. Alert fatigue. Alert fatigue is a situation in which alerts fire with such frequency or non-action ability that they become meaningless. Actionable is an often overlooked aspect of alerting. But, without it, an alert is actually just noise. See, non-actionable alerts make it impossible to distinguish which things are important and, therefore, you start to ignore everything. In the same way, a perfectly actionable alert that happens too frequently, blurs itself out. You don’t know whether you should even pay attention. Now, someone that was responsible for entire teams that have suffered from the effects of alert fatigue, I can tell you as human beings, we are just not designed for this. The effects are just bad. Every ignored alert could potentially be a severe incident. But, if you honestly can’t tell, and it’s the 150th time you’ve been paged this week, you’re going to ignore it. That means reliability will suffer. You’re going to break SLAs, and that’s going to start to sink your team’s morale. And that’s where the worst possible outcome comes in here. Alert, fatigue burns people out, and it burns them out incredibly fast. I’ve lost team members. I’ve seen people leave the industry, because the weight is just too much.
I moved out of the APM industry and into machine learning operations when I joined Verta. And that’s when I started to learn about model monitoring. At first, I thought, well, I understand monitoring. This sounds easy. But, the more I learned about model monitoring, the more I realized it was fundamentally different.
You see, I understood monitoring in terms of performance metrics, but when I learned more about model monitoring, I realized that the goal is different. It’s to provide insurance, that the results of applying a model are consistent, reliable. You need to know when models are failing, and that sounds simple, but as it turns out, failing has multiple meanings. Since a model can fail to operate, or it can operate smoothly but produce incorrect results, or it can be sporadic or unpredictable. And when a model fails. As the owner, you’re now in a critical response role. You need to quickly diagnose the root cause of the problem. This is something I understand. Production critical response is all about quickly identifying and fixing the cause of a problem.
Once again, I thought, okay, I get it. We detect model failure. That sounds easy enough. But, as it turns out, knowing when a model fails is not necessarily a simple problem. Without a ground truth reference to measure against, failures can be difficult to detect. Determining whether a deviation has occurred, involves complex statistical summaries that we collect over time. That could be metrics, or distributions, or binary histograms, or any other statistical measurement that is important to the model. And they are subject matter and model specific.
Determining root causes are inherently difficult in any system. It’s more so in ML monitoring, since the model was part of complex ecosystem. Understanding that jungle of dependencies and information requires a different perspective. We have to look from a global perspective on the entire model in the pipeline to understand what’s happening.
Once an issue is identified, remediation can be very hard. These processes can be slow, they can be error prone, they can be highly complex. Now, as I thought more about it, I realized that in many cases, the best solutions would be self-healing systems, in which the system takes action automatically to remediate problems, but it’s still smart enough to know when a human needs to be involved to make a decision. This is the approach I’ve taken in the past for large scale high volume systems. By this point in the voyage, I was hooked. The domain of model monitoring is unique. So now I really needed to understand it in depth.
As I heard data scientists and ML specialists tell us their stories, I started thinking about what made ML monitoring so different. You see API metrics are predefined. Throughput is in request per unit of time, error rate is a percentage, latency is measured in the average time for the request. Typically, an APM agent will collect the data and normalizes it into a predefined format, and then sends it off to the vendor for storage and aggregation. This works wonderfully for systems with no characteristics. And, while it’s extremely powerful, unfortunately, these units are too simplistic to attract the quality of a data model or a pipeline. Model monitoring adds two very new and unique dimensions to the problem. Data quality and data drift. We need to know that the model data itself is reliable, and we need to know how it is changing over time.
One of the things that drew me towards ML ops and model monitoring in the first place was the similarities I saw between these concepts in the APM world I was used. See, deployed models, handling prediction in a production environment are production systems. The golden signals of APM apply to them just as they apply to any other production system. The model has to be available. It must be capable of providing results in a timely manner, and it must do so without generating errors. We are human beings, not computers, so the vast quantities of data are useless without a method of visualization that can translate it into comprehensible information. Now, when something goes wrong, you need to know immediately. Alerting is fundamental, since without it, taking appropriate action in a timely manner becomes impossible. Now these are all core APM concepts. I’m really comfortable with these.
As I learned more about model monitoring, I came to the realization that while monitoring is a superset of APM. Everything APM provides is now mandatory for operations of a production model. However, the qualitative needs of model monitoring mean that APM is no longer a sufficient solution. Firstly, not all work is production work. Models have long complex life cycles, and monitoring needs to be integrated from the beginning to produce the traceability required for issue detection. That means monitoring the development and training of the model, the production performance, the retraining, canary deployments. Every aspect of its life cycle. You see, APM tools are designed to tell you when something has gone wrong. Model monitoring at its root is about uncovering why a deviation has occurred and how to remediate it.
Monitoring data models and pipelines brings an entirely new set of concerns to the table. Information for model performance is not a simple comparison between two metric activists. It is a controlled comparison of statistics, distributions, across time. To provide value, it’s got to be repeatable and consistent. If I run the same experiment a million times, the monitoring output must be consistent across all those runs or the value of those comparisons has been lost. Meaningful deviations from the norm must be capable of driving alerts that are actionable immediately. This could be an automatic self remediation that rolls back a canary or slack message that tells the owner what is wrong and connects them to the resources to address it as quickly as well.
APM agents are designed to be out of the box functional. They don’t advance how to get the data you’re most likely to need. This is what makes them so powerful. However, model monitoring does not fit into that box. This is an important difference to understand. In model monitoring only you, the model owner, has the necessary information to know what needs to be monitored. This is because every model and pipeline are different, highly specialized. You designed, built, trained, and operate these models, and you understand them. As a result of the expertise that you have, you are the one who knows which metrics and distributions are valuable for measurement. This information is part of the process of development of a model, and as the owner of the model, you’re the one who knows how to evaluate them. Ultimately, only you know what the expected outcomes are for your models. And you’re the one who knows how to determine correctness.
In the world of APM, comparison is predefined as part of the system. Evaluations come down to comparing two time windows and calculating a delta between the metric values. In model monitoring, comparison does not a singularly defined concept. It is inherently part of the model development process. It evolves as the model evolves and matures. For instance, you may need to compare the current iteration of an experiment with the training of the test data or a previous iteration. Once you’ve gotten a model that shipped, you might need to be comparing it to a golden dataset periodically for adherence. You might have a model that you’ve put into production as an endpoint serving predictions, and you want to compare it to a historical baseline, or perhaps really this is the most interesting case. You might need to provide a comparison that is completely unique to your pipeline in your domain. These comparisons aren’t possible in a traditional LPM, APM world, because they require the monitoring system to have information that can only be provided by the model owner.
I realized that in model monitoring, determining when a change is meaningful and significant, is no longer a simple comparison. Only the owner of the model knows the tolerances for deviation that are acceptable, and how to determine when it’s significant. Through models, pipelines are living systems. They change with time and only the owners of those systems can know when the tolerances are changing, and how that should affect overall performance. And this all comes down to the most important aspect of the monitoring system. Only the owners of the models and pipelines know when an alert should go off and what actions should be taken. As the APM systems, unfortunately, are the downside. They make it incredibly easy to track an alert on everything, and you automatically do so when you go overboard. And that’s when alert fatigue starts to set in. With model monitoring, we need to be more selective. Only the model owner can make the determination of whether a condition should create an alert, and they’re the one who have the knowledge of the model to know what that alert should be.
I joined Verta because I could see that the needs of ML monitoring were built on the background I had in APM, but would require a new set of concepts and methods to approach. As we researched the area and interviewed customers, I would learn that the needs of each ML customer are unique, distinct, but they’re all built on a foundation I understood. Not understanding of the problem, domain inspired Verta to build a framer from auto monitoring that could be used to solve this problem in a general purpose way. Our framework expresses the domain in terms of, for top level concepts. A monitor density represents the model or pipeline you want to monitor. Profilers are functions that can be executed against data frames to generate statistics. The collection of statistics about your data are called summaries. Each summary contains potentially many samples, each one is of a single measurement, and they’re all the same type. Each sample can be tagged with metadata for cross analysis and filtering. Periodically, triggered alerters is run against the data generated and determine if a deviation has occurred. This can potentially trigger an action.
Our goal was to make this as simple as possible for you to integrate into your process. Whether batch processing, operating a live model, or managing a pipeline, the process for integration is the same. You use our client, you record your monitoring data, the samples are generated and stored directly into the Verta data store where they’re immediately available for visualization in our web UI. Your ground truth determination is applied, and our alerters evaluate incoming samples for deviation to trigger actions.
Now, I’d like to show you how you can use our platform to monitor an example model that we built. This example uses a logistical regression model to predict insurance cross selling.
So, here we’ve set up our data model. We’re going to do cross selling for insurance. We’re going to attempt to predict which customers should be cross sold a plan based on what they already have. We’re setting up some data on where they’re going to define some columns, and we’re going to run some initial predictions. Most importantly, we’re going to set up a monitored entity. It’s got a name, in this case, something I recognize, and I’m going to put it in a workspace. This is a shared environment that allows my teammates to collaborate with.
Now, the next thing I need to do is define how I’m going to create my statistics for the system. My columns are of two types. I have numeric columns that are continuous, and discrete columns that are binary. Now I also care whether or not I have a column that’s missing. So I’m going to pull in three profilers or statistics generators. One that looks for missing values, one that can do binary histograms, and one that can do continuous histograms. Then, for each of the column types that I have in my model, I’m going to generate a base sample. The summary that is going to be created is of a known statistic that maps to that columns type and has a name that we can find later.
I’m also going to run a first set of data through to create my baselines that I’m going to use for comparison. Up next, I’m going to create some alerts. In this case, for each summary that I defined above, I’m going to create an alert that references a predefined sample that I’ve created, and if the difference between them is greater than 0.2, it’s going to create an alert for me. In this case, that alert is going to go to a slack channel that I can find. Now, I’m going to run some test data through this. The first set is good data. It matches the expected distribution of the training set I put in. After that, I’m going to run this test again, but this time I’m going to restrict it to only customers whose previously insured was already one. I’m going to bias the results.
I want to take a look at what this looks like. Over here, you can see we’ve got 664 samples that we have collected. Underneath the chart you can see the name of each sample and its type, and the data that is included. Now, I want to take a look at this annual premium histogram. That sounds kind of interesting. Let’s take a look at what that looks like. Here, we can see late across the X axis time, each and vertically are the samples that were received. Each sample is a distribution. As we roll across them, we can see how that distribution is expanded on the right. And when we get to here, we see something interesting happen. We see that we have a sample that doesn’t match what we expected. That’s interesting, and if we notice the labels over here, one of them is previously insured. Now that’s interesting because I know that that’s the bias I introduced.
I want to look at active alerts. We have three alerts that have been activated by the system. The first one is on annual premium that I was just looking at. One on vehicle damage and, hey, one on previously insured, I’m going to be curious. I’m going to take a look at that.
This is the discrete histogram of binary histogram, and we can see when we first start out that the results are well distributed and consistent from sample to sample, until we get to here. And suddenly we see something really interesting happening. All of the predictions for the previously insured data set, are coming out as ones. There are no zeros, and that is definitely an aberration that we need to investigate. I’m at Verta because I see ML ops customers struggling to access the tools they need to efficiently do their jobs. And I want it to be a part of helping solve that problem.
I remember my experiences in high severity incidents when my tools could mean the difference between an outage and a non-event. I believe that ML ops teams need to have access to tools that can provide the same capabilities for these new problems and domains that have emerged. I worked in APM long enough to learn another important lesson. Building these tools are hard, which makes it risky and expensive, and too risky and expensive for most organizations to do it in house. However, I also know that these tools are mandatory to allow machine learning operations teams to continue doing their jobs and to tackle the problems coming. Learn more about Verta model monitoring on our website. We have a community you can join. We’d like to see you to get the most out of your models. Thank you.
Over the last two decades Cory has worked across the spectrum of software engineering; from embedded systems to massive high volume Kafka pipelines. Building systems that directly address the needs o...