Real-world Strategies for Debugging Machine Learning Systems

May 27, 2021 04:25 PM (PT)

Download Slides

You used cross-validation, early stopping, grid search, monotonicity constraints, and regularization to train a generalizable, interpretable, and stable machine learning (ML) model. Its fit statistics look just fine on out-of-time test data, and better than the linear model it’s replacing. You selected your probability cutoff based on business goals and you even containerized your model to create a real-time scoring engine for your pals in information technology (IT). Time to deploy?

Not so fast. Current best practices for ML model training and assessment can be insufficient for high-stakes, real-world systems. Much like other complex IT systems, ML models must be debugged for logical or run-time errors and security vulnerabilities. Recent, high-profile failures have made it clear that ML models must also be debugged for disparate impact and other types of discrimination.

This presentation introduces model debugging, an emergent discipline focused on finding and fixing errors in the internal mechanisms and outputs of ML models. Model debugging attempts to test ML models like code (because they are code). It enhances trust in ML directly by increasing accuracy in new or holdout data, by decreasing or identifying hackable attack surfaces, or by decreasing discrimination. As a side-effect, model debugging should also increase the understanding and interpretability of model mechanisms and predictions.

In this session watch:
Patrick Hall, Principal Scientist | Visiting Faculty, bnh.ai | The George Washington University

 

Transcript

Patrick: Okay. Hi, this is Patrick. Thrilled to be here, at least virtually, and today I’ll be speaking about real world strategies for modeling debugging. And we’ll start with a little introduction of what model debugging is and then get into how to do it, maybe take a couple little detours along the way. All right, so, like many of you, I took software classes as an undergrad and as a graduate student and spent a lot of time learning about software engineering. And one thing that I know that I learned during all that in my subsequent professional experience is that all software has bugs. If software is untested, that doesn’t mean it works, it just means that we haven’t found a bug yet. Moreover, I’m not personally aware of any machine learning implementation that is not software.
I’m not saying that’s not impossible or you and I may have different definitions of software, but basically machine learning is software, and all software has bugs. And as we’ll see those bugs are turning into big problems in the real world sometimes and we want to squash them. Okay, so let’s get into this. What is model debugging? Like many things in machine learning and AI, there’s a lots of words for this. I hear AI assurance or machine learning assurance, I hear model diagnostics. All of those are synonyms for what I’m talking about today. And if you prefer one of those terms, that’s absolutely fine. I’ll just be using model debugging for my purposes today. So it’s a new idea in some ways. Nothing under the sun is really new, but but model debugging is new in some ways. So let’s say it’s an emerging discipline that’s focused on finding and fixing errors in machine learning pipelines.
We want to port over a lot of the best practices that we learned from our friends in software engineering to test our machine learning models whenever possible. I definitely consider that part of model debugging. And then as I hinted at, it’s new, but how new is it? Because there is this whole highly developed field of regression diagnostics that people have been working with for decades. And so we definitely want to take those lessons and port them to machine learning as well. And the core goal of model decoding is to promote trust in models. I want to say that this model is fully tested from a software perspective and from a mathematics or machine learning perspective. And when I do that, I should promote trust within my own organization for my customers, for the general public. And another great benefit is when I really start digging in and testing my models I tend to learn more about them and it increases the interpretability of my model as well.
So just to summarize, we want to promote trust directly with model debugging. And there’s a nice side effect of when I get in there and started digging around and fixing things I learn more about my models. And so my interpretability tends to increase as well, too. I’ve tried to pepper little links through these slides. So down here this is the one of the first, maybe the first, academic workshop on machine learning model debugging 2019 ICLR. And so if you’d like a different perspective or perhaps a perspective that’s more focused on deep learning, I’d suggest checking out that link. A lot of the examples I’m going to be using are for more structured data and tree based models. Okay, so why debug models? We’re already also busy just training the models. I don’t have time, I don’t have the resources. What’s the motivation here? I thought my machine learning model just worked. Well, let’s talk about some of the motivation and, unfortunately, the motivation is becoming pretty public and dramatic. So let’s start here with Tay. And Tay is a chat bot released in 2016 by Microsoft Research, who was nearly immediately …
I’m going to use the word attack. I think it is technically correct. Underwent a data poisoning attack from Twitter users that caused what I would call an algorithmic discrimination incident. So within about 16 hours of Tay being deployed, chat bot on Twitter, users understood that they could say very nasty things to it and it would say them back. And in doing so they turned this into a neo-Nazi pornographer service as opposed to a friendly young person chat bot impersonator. And so this is maybe one of the first high profile AI incidences and I think it’s important to call out because I would call this a hack, a security vulnerability. And that very quickly over the course of just hours spiraled into an algorithmic discrimination issue. And so when we deploy machine learning and AI systems into the real world and broaden the context in which they’re functioning. More and different things can go wrong than we might be expecting from the in lab design perspective and I think Tay is a great example of that.
Okay, we’re going to go through a couple more of these. So same timeframe a few years ago, the now notorious COMPAS risk assessment instrument used to help make decisions about whether people should be paroled, whether people should be kept in jail before their trial. It was found that an inmate in New York state actually missed parole because inputs into this black box system were wrong and his risk score was too high. And he was denied parole because of this. And so I would say that this is an issue of transparency and accountability into the functioning of black box algorithmic systems and also shows us the importance of user override and appeal for these systems. So a very good debugging tip right off the bat is to make sure that if your machine learning system is interacting with people that those people have the ability to appeal inevitable wrong decisions that your system will make. And especially if you’re working in a sensitive area, like criminal justice, or employment, or lending. It’s big deal.
Another famous one was Woz noting this incident that happened with the Apple card in which some women received much smaller credit limits than their partners. And I think that this is still being investigated. I’m not saying that a crime was committed here, but definitely a reputational hit for these two very well resourced companies. Certainly Apple and Goldman Sachs have plenty of good data scientists and plenty of good attorneys, but this still happened to them. It could happen to any of us. Okay, the very famous incident where the Uber struck a pedestrian in Arizona, killed that person. What I like to note here is that the NTSB, the government agency that tracks these kinds of incidents, declared that Uber’s self driving systems had no notion of a jaywalking pedestrian. So again, model debugging tip. If you can foresee a problem, like a jaywalking pedestrian for a self driving car, you want to make sure that you have strong mitigants in your system to deal with those foreseeable problems.
We’re doing a greatest hits here. You may have heard some of these, maybe you haven’t, but in this one a large insurance company, their heart was in the right place. They wanted to reach out to some of their most vulnerable patients and give them extra health care. Unfortunately, they made an experimental design mistake that led to actually hindering care for some of those most vulnerable people and potentially at a very large scale. So, really, the list just goes on and on here. And we can see repeated failures. Microsoft happened here, Microsoft happened here. We want to learn from those failures and not repeat them. We’ll see another one in just a second. I believe that these six all happened in the same week last year. So, really, what I’m trying to motivate here is we are leaving bugs in our code that are causing major problems and we need to fix them. So here’s our last AI incident.
This is Lee-Luda. Five years after Tay, nearly the exact same kind of incident. Okay, so not only are we leaving bugs in our code, we are leaving bugs in our code, causing incidents that have already happened. And if we just looked at other failed designs, we would know about them. So another model debugging tip, check your design against pass fail designs, as is done in other mission critical commercial technologies, like aviation. So just real quick on Lee-Luda before we move on, South Korean chat bot that was caught making discriminatory comments. Very, very similar incident to Tay, which occurred five years previously. Okay, so we don’t want to repeat the same incident. I won’t beat a dead horse here, but I think there’s roughly 1,200, maybe more, of these reports in the partnership on AI incident database. So we’ll see some more data about this. These incidents are increasing.
Okay, so the bugs in our machine learning pipelines are causing problems in the real world and they’re causing repeated problems, which aside from being harmful and expensive is also just embarrassed. Okay, so, like I said, we’re seeing more and more of these. I seriously doubt that this jump between 2019 and 2020 reflects the actual number of incidents, but I do think what it reflects is the attention on incidents. So regulators are starting to pay attention and we’ll see some of that. Journalists are really starting to pay attention. And whether you’ve committed a crime or not that kind of attention on your AI systems is something in general you’d probably like to avoid. All right, so we did a little breakdown at BNH on these common failure modes. So the thing to watch out for the most is algorithmic discrimination. Sad, but true. So most of the failures that are happening these days are related to algorithmic discrimination.
I think it’s debatable whether testing for algorithmic discrimination and remediating algorithmic discrimination should really fall under model debugging. I think technically it does. Obviously it’s just such a serious problem and such a wide and deep discipline on its own. I’m not sure that it fits nicely into the model debugging topic, but I’ll touch on that subject a little and go ahead and give the caveat that there is a lot more that we all need to do besides what’s in this presentation. So most common kind of failure, sadly, algorithmic discrimination. Followed by those lack of transparency or accountability type failures that we discussed with the gentleman who was kept in prison for too long. We’ve seen a lot of silly performance errors, and we’ve seen some dangerous ones that actually hurt people. The Uber incident probably being the most noteworthy, but definitely others as well. Another common type of incident is using data that is not allowed by privacy policies or privacy laws. That’s another wash, rinse, and repeat.
Data scientist gets data they’re not supposed to have their hands on, puts it in a machine learning system, goes to market, violation ensues. People are doing that one all the time. The two other kinds of incidents that we see rising to the top are unauthorized decisions. So machine learning or AI systems operating out in the real world have a very broad operating environment and they may do things that you don’t expect. And so I’m a big proponent of limiting them manually, saying these are things that you cannot do, machine learning system. You cannot answer this question. And I think that’s a really good idea because, if you don’t watch them, they’ll end up making decisions that maybe you didn’t even know that they were able to make or make them in a context that you weren’t aware of. And that could have reputational and legal consequences. There’s also some security breaches related to machine learning and I expect that we’ll see that rise in the future.
For now, that’s the smallest top category. All right, and, again, as I hinted, there are serious legal and regulatory concerns for some people today, for many of us on the horizon. So the European Union just proposed a major almost GDPR level AI regulation. And so we can expect that if that goes into affect we will feel it in the US. And just like GDPR changed how we operate with data in the US, this regulation could change the way we work with AI in the US. And among many, many, many other things it does call out quality control and quality assurance specifically. Oka, so getting ready for future regulations is a big reason to do model debugging. The US FTC, Federal Trade Commission, the organization that fined Facebook $5 billion in 2019 is very interested, too. And they want to make sure that your models are validated and revalidated. They’ve been saber-rattling on AI enforcement since the beginning of last year and have put out two very direct guidance blogs about this.
And if you’re working anywhere in the US economy on AI, you should at least be aware of this guidance. And there’s a link to one of them there on the screen. Another way that we can go wrong legally … And I’m not a lawyer, should have said that from the beginning. Not a lawyer. Another thing that we have to watch out for that’s just common sense is there are safety expectations. There are product liability laws and standards around negligence that just have to be met. You can’t make unsafe products and some of the things that we’ve seen with AI in recent years and months I would say may rise to that level. And there could be basic product liability and negligence concerns to be aware of as well. So why debug models? Some of us work in regulated industries right now and already have to obey regulations with our MLNAI systems. I think for everybody else, that’s coming soon.
All right, so now we’re going to get into more motivations for model debugging, but also how do we do it. And going to set up a straw man. I’m going to set up an example for us to tear down. Something that looks good, but it actually isn’t that good. And the basic lesson here is the way I was taught to assess machine learning models, very likely the way you were taught to assess machine learning models, was for research papers. It just doesn’t work in the real world. And I know that may sound like a dramatic statement, but hopefully I’ll have you convinced by the time we get through this straw man exercise. We’re going to set up a really nice model and then tear it down pretty viciously. So I teach at George Washington University. This is a model that I would give even graduate students an A on. It’s the kind of model that in my professional work I would be thrilled to see a client deploying. So let’s go through the good aspects of this model.
It’s constrained, so it uses monotonicity constraints. As a input value goes up for certain variables, the model output has to go up. That’s a positive monotonic constraint. For other variables, as the input value goes up, the model output has to go down. That’s a negative monotonic constraint and those are picked from domain experience. That’s a really good way to get domain knowledge into your model, easy to do XGBoost, so I like that. The model was selected using validation based on early stopping and XGBoost with a manual grid search over hundreds of models, random grid search. I know we can do better, but at least we did a grid search. Seemingly well regularized, should be fairly stable, gradient boosting techniques incorporated, row and column sampling, and then there’s an explicit way to specify L1 and L2 penalties and XGBoost. No evidence of over and under fitting.
The test accuracy was right below the validation accuracy, was right below the train accuracy. Looked great. We actually compared it to a linear model that was more accurate than the linear model. You should always check. That’s not always the case. And we at least put some thought into selecting the decision threshold. We maximized F1 to select the decision threshold, but, if we stop and think for a second, this just doesn’t tell us very much that we need to know for the real world. For one, this model just actually isn’t necessary. Okay, so we need a little bit more introduction to the case here, but the case or the application is just really the credit card data from Kaggle. You’re probably familiar with it. Very small, very simple, pretty unrealistic, publicly available data where we’re trying to predict if someone will make their next credit card payment. So if I do some digging in my model, look at residuals, it’s a great way to do model debugging, look at variable importance, I see a striking pattern. So one pay zero, this variable, is someone’s most recent repayment status.
It looks to be about four times more important than the rest of the variables and that’s never a great sign. That’s like when we have a giant regression coefficient in our regression model. It puts all the weight of the decision making on one variable, which can be a security issue, too. And then if I look at the residuals, I can see some confirmation of what I’m suspicious of here. So the pink is for people who do default, residuals for people who do default. So these are good values of pay zero. These are things like paying your bill on time, paying your entire statement balance, not using your credit card. So that’s my most recent repayment status. And when I go on to default the model is essentially shocked. I get these giant … I shouldn’t say giant, but but large numeric residuals in that case. And then for these bad values or late repayment statuses of my most recent repayment, I get large residuals again when someone does not default. So, again, my model is just shocked.
If you don’t obey what your most recent repayment status says you’re going to do, then the model just issues these large numeric residuals. So this model with something like tens of thousands of if then rules in it is really just a glorified single business rule. So if pay zero is greater than one, then we say the person defaults. That’s really what this model says. Instead of going with that simple business rule, which is incredibly transparent, I’m now releasing tens of thousands of rules out into the world that I don’t fully understand, that could have potential security vulnerabilities to be hiding sociological bias. Turns out it is. Some of these, especially adverse impact ratio numbers, we want to see these above .8. So we’re just way off across some marital statuses. There’s other demographic information in this data set that we could have tested across, too. This was just the most egregious example of just some very suspicious discrimination behavior across marital statuses.
And then finally, as I keep hitting at, there are lots of security issues to at least be aware of. I don’t think these attacks on machine learning systems are the most common way for a company to get hacked today. I do think they’re going to be more common and we’re already starting to see them happen. So there are lots and lots of different hacks on machine learning models that have been published. This is six out of maybe two dozen that I’m aware of and it’s much harder to hack a business rule than it is to hack tens thousands of black box rules that aren’t really understood by their users. So I just want to pause here for a second and restate all the good stuff I talked about this model, all the test error. Just quite simply, test error doesn’t tell you whether the model makes common sense or not. Test error doesn’t really tell you unless you break it down by demographic group whether you’re discriminating or not. Test error doesn’t really tell you about security vulnerabilities.
So the way we’ve learned to assess machine learning models for research papers in school doesn’t tell us about things like common sense performance problems, potential discrimination, algorithmic discrimination, security, data privacy. What about AUC tells us about data privacy violations? Practically nothing. So when we deploy machine learning models into the real world, we have to think about a lot more. And that’s what we’re going to do for the rest of this presentation. All right, so first I divide model debugging into the basic software QA, quality assurance, steps and basic IT governance. And then I try to think about more specific techniques that can be applied to investigate response functions or decision boundaries and really probe the logic of our complex machine learning systems. So we’re going to talk about the IT aspect first and then jump into some of these more specialized techniques for machine learning models.
I’ll say right now the IT stuff’s pretty boring, but you just got to do it. All right, so most big companies who make or use software have a decent idea of how to control risk around it. And the question becomes why are data scientists, or machine learning engineers, or machine learning projects given a pass on basic IT governance and software QA. I’m not saying that happens everywhere. I will say that happens a lot of places I’ve seen, though. So we know about things on the IT governance side, we know about things like incident response plans, we know how to manage development processes, we know about things like code reviews and pair programming. We have, for the most part, somewhat decent security and privacy policies in many organizations. And so the key is to take that stuff and make sure that it’s applied to your AI and ML development efforts. So that’s part of the equation. Another part of the equation is what do we do specifically for governance of ML and AI development and systems once they’re deployed.
There’s a ton of known information about this, as well, and it lives under the subject heading of model risk management. And model risk management has been around for many years, but it was codified related to fallout from the 2008 financial crisis in this brilliant paper, Interagency Guidance on Model Risk Management, often known as SR 11-7 from the Federal Reserve. And so this brilliant paper on machine learning or predictive modeling risk management puts forward really smart ideas about executive oversight. Having a boss who’s in charge of making the machine learning right is one of the best ways to make the machine learning right. Having someone who gets big bonuses, having an empowered executive with a staff and a budget, who gets bonuses when the machine learning does the right thing, and faces the consequences, potentially up to being fired, when the machine learning does the wrong thing. That changes the tone of how organizations do machine learning. And in banking they have that job and it’s called the chief model risk officer.
And I think that having that kind of executive is one of the smartest controls you can have, if you’re going to be serious about machine learning and AI. Documentation standards, models have to be documented like software. And even more so, who do you call when something goes wrong? What are the data inputs? What are the data outputs? Are they going into other systems? Are they coming from other systems? All of these things need to be documented and, moreover, in a standard way so that people across different organizations within the same company can understand it. There’s this idea of multiple lines of defense, where we have multiple groups of people whose job it is to get the machine learning model right and they check each other. Common sense helps a lot. And then just this notion of inventories of models, knowing how many models you have, knowing where they’re deployed, how they’re deployed, and monitoring them, making sure that they’re monitored.
Are their inputs being monitored for drift and anomalies? Are their outputs being monitored for drift and anomalies? These are all things that fall under model risk management. And while most companies probably don’t want to undertake full model risk management, any of these would be a great lesson to learn and bring into your organization if you’re not doing something like that today. On the software testing side I’m not going to spend a ton of time. Some of you are probably better at this than I am, but we know how to test software and we need to make sure that testing is applied to machine learning. And then I’ll call out chaos testing is being really, really appropriate for machine learning because we tend to deploy machine learning and AI systems into these chat bot context where just almost anything can happen in terms of a conversation. And it’s hard to put guardrails on it. And so testing that system under intentional failures, adversarial conditions is really, really important to harden it and make sure it actually works.
Other things that are great for machine learning, other testing techniques that I would call specifically model debugging for for machine learning are things like reproducible benchmarks. So every day or every week start from a reproducible benchmark and take measurable steps away from that reproducible benchmark. Random attack is the idea. It’s a lot like chaos testing, but it’s more specific for machine learning. So in a random attack we expose machine learning systems to just tons and tons of random data and see what errors happen. And start tracking them and squashing them. And that’s another really good way to harden your system to unexpected data inputs, which happen often out there in the wild. Okay, so, like I said, I think it’s just due to hype. Everybody’s excited, probably happens with almost every single new technology, but many organizations, organizations that do a great job with software development are not enforcing those same standards on data science and ML teams and I’m not sure why.
Oftentimes, and I’ve seen this with my own eyes, data scientists and machine learning engineers, for the sake of go fast and break things, are allowed to operate in clear violation of security and privacy policies and even laws. And that just doesn’t make any sense. Yet, you might beat your competitor to market by a few months, but when you get slammed by an FTC enforcement action or something related to GDPR, you’re going to be set back more than a couple months. So I know there’s this hype, there’s this notion of go fast and break things, but I think it’s time that we slow down a little bit and act like adults. Just my personal opinion. So another striking thing that I’ve noticed is many organizations have instant response plans for all of their mission critical computing except machine learning systems. What? I mean, just have an incident response plan for your machine learning systems, too. Of course very few nonregulated organizations are practicing model risk management.
I don’t think it’s a great idea for every company out there to completely implement full on model risk management, but I think selecting from the buffet of great risk controls from model risk management is a good idea for every company. And I think if we’re just serious with ourselves for a second and take a step back, although machine learning has been around since the late 1950s, other people might argue earlier, it’s been deployed extensively in certain verticals, eCommerce, banking, finance. In terms of it being used across the entire economy, we’re really just in the wild west. And hopefully I’m starting to convince you. I showed you all those incidents in the beginning. It’s time to get things under control so that the discipline can mature, public can have more trust in it, and we can make better models. Google actually puts out some great materials on model debugging from a more basic IT standpoint and I’d urge you to check those out if you’re find this interesting so far.
That’s what this link down at the bottom is. All right, how do we test our machine learning models? Probably what you tuned in to see, not me rambling about my thoughts about model risk management. So let’s talk about how to debug these models. All right, so I think one major way that we debug models is what I call sensitivity analysis. It’s a very basic idea, finding or simulating data that is adversarial, or interesting, or random and just seeing how your machine learning model behaves. As this old diagram, which I really like, reminds us, machine learning models can do basically anything on data outside their training domain. And unless you’re testing that with sensitivity analysis, you’re just not going to be aware of it. And so here we see basically a linear fit which looks right in the domain of the data by some simple neural network. And then we see outside of the domain of the training data this overly complex machine learning system has twisted itself into knots to fit the data and it’s doing almost anything outside the range of the data.
Don’t get caught in that situation. Test for sensitivity analysis. There are structured ways to do sensitivity analysis. I’m going to talk about some of the most basic kinds, but you’ll be able to Google and find more advanced ones, I’m sure, if this is seeming easy to you. So one way I like to do sensitivity analysis is partial dependence and ICE, Individual Conditional Expectation. And maybe that doesn’t traditionally fall under sensitivity analysis, but I would say we’re perturbing the data, we’re looking at what happens, and more so we’re doing it in a structured way. I think we can learn a lot by doing that. So what we see here is, first, our data. It’s always good to look at the data. And the blue is people who did not default. They are more frequently distributed over these better values of pay zero and we have less people who did default in pink and they are more frequently distributed over high or worse values pay zero.
What did the model do? That’s what partial dependence and ICE tell us. Partial dependence is this gray line. That’s the estimated average behavior of the model and then ICE are simulations for individuals. Then what we look for here is many things, but the first thing I look for is does ICE follow along with the partial dependence. And in this case, it does. And what that tells us is the partial dependence is a trustworthy representation of our individual model behavior. The average representation is representative of the individual representation for this variable. If we didn’t see that, if we saw ICE going in different directions than partial dependence, that would tell us that there was an important interaction in our model that would drive individual behavior away from average behavior. We don’t see that here, so that’s neither good nor bad, just a sanity check.
We also see that the model learns a lot where there’s data and then flat lines out here where there is not data. And so I think what happened here is we just got lucky with the monotonicity constraints. It won’t always work out this way, but the monotonicity constraints caused the model to take the behavior from where there was data and just push it out to the highest possible values. In this case, it worked out, but there’s no data here. And that’s going to be a problem over, and over, and over again for this model that we would not have seen just looking at AUC or confusion matrices. And I’m not sure I’d even have machine learning models making these decisions out here. Maybe that’s more appropriate for human case workers, or business rules, or something like that. What I see when we don’t put in a monotonicity constraint is that these lines drift back down towards the average. So we see really weird things, like someone who’s seven months late on their most recent payment has a lower probability of default than someone who’s one month late.
And that just doesn’t make any sense and in general that’s the kind of thing we want to watch out for and why we would use monotonicity constraints. Another interesting thing here is that missing is the best possible value for someone’s most recent credit card payment. That makes no sense. You probably want to change that. And, again, it’s a security problem. If I can somehow hack in a missing value for pay zero in your machine learning pipeline, I’m getting a credit card. And, again, just being able to tweak any value for pay zero, because the model’s so dependent on it, is a security problem. If someone can change someone else’s value or their own value of pay zero to a value that they want, they have huge control over whether they’re going to get this card or this credit offer or not. I think that’s all we wanted to cover here. Before we move on to the next slide, let’s remember these ICE lines because I’m going to use them in the next slide, because we have a big swing.
All right, I’m going to pick one of these rows that has a big swing in its ICE curve and I’m going to perturb that even more. And then I’m going to get this more full, but certainly not full, picture of how my machine learning model behaves. When I have all these dimensions and all these possible outcomes in the real world, it’s nearly impossible to know everything that my machine learning model is going to do, but this gives me a better picture than I would have had just by looking at AUC. So one nice thing here is I can confirm that my monotonicity constraints held across all these different thousands and thousands and thousands of perturbations. So what’s happening here is I’m taking that row with the ICE line with the big swing that I know has the ability to swing my models predictions and then just perturbing it thousands and thousands of times, running it back through the model and making these response surfaces.
So we find some interesting thing in these response surfaces that we never would have found just from looking at AUC. So we confirm monotonicity, check. Great, at least empirically, but we find these other logical bugs. And so this model might not be able to handle prepayment. What this tells me is once someone gets up above that two months late threshold, it doesn’t matter if their most recent payment was above a million dollars. I’m still going to say they’re going to default. And that’s not necessarily wrong, but it’s at least something you want to be aware of. If you had a very high net worth client go on vacation and prepay a million dollars, you could still give them a default decision two months later if they didn’t pay their bill when they were on vacation or something like that. Not necessarily wrong, just something you’d want to be aware of. And, again, this behavior, where we see an extremely suspicious spike for pay amount one and pay amount two, most recent and second most recent repayment amounts, not necessarily wrong.
Just very conspicuous spiky behavior that really stands out as a way of, if I was going to try to do an adversarial attack on this model, I’d be really interested in this little spike if I can find it. I can shoot my own or someone else’s probability of default up really high really quickly by setting their first and second payment amounts very low. That does make sense, it’s just the shape of this response function is suspicious and surprising to me. So, again, we learned a lot here that we wouldn’t have learned just by looking at AUC. All right, so there’s other ways to do sensitivity analysis. There’s a great package, the interpret package, that is maintained by Microsoft Research. Open source package has really interesting, great sensitivity analysis functions and they mostly come from this other project called [SALIV]. So in addition to what I’ve shown here, we’ve talked about random attacks. I’d have you look at SALIV and interpret as well for doing sensitivity analysis.
The other big branch of practical model debugging is residual analysis. And this is a great paper called Residual Surrealism that is just a ton of different simulated data that allows you to put messages in your model residuals. And these are linear models, but I think it just helps point out the importance of inspecting residuals, which we can do for machine learning models. We did this for decades with linear models. Why aren’t we doing it for machine learning models? It’s not always as simple, it’s not always as informative, but it’s still useful. So we want to learn from our mistakes with residual analysis. Residual analysis is the mathematical study of those modeling stakes. We want to learn about those mistakes, so that we can try to fix them before they happen. And so the important test here are just to look at residuals feature by feature and by level. We can do this overall residual versus predicted plot. Great. And there’s some fun ones in this paper, but it’s oftentimes more informative to break them down by feature and by level.
You can learn a lot more doing that and that’s what we saw in one of those first slides where I made the claim that this model is just a glorified business rule. We want to look at our error over segments, including demographic segments. And then there’s all kinds of new things, like Shapley contributions to log loss that we can look at. And we can even model our residuals and learn that way. So sensitivity analysis, one big bucket of model debugging, residual analysis, another big bucket. All right, so here I’m looking at the error across segments. We’re looking at the accuracy across segments. So instead of just looking at AUC over the entire data set, I’m breaking it down and looking at it across important segments, like pay zero values and like sex or gender. And we can see just a massive problem here. This model has no idea what it’s doing once the data gets sparse. We saw that in the partial dependence, we’re seeing it here now as we’re breaking down all these different error measures across values of pay zero.
There’s some good news for sex or gender, is that the model performance is pretty equal across men and women. It wasn’t particularly equal across marital statuses. It is pretty equal across men and women. This is just the tip top of the iceberg for what you would need to do to assure nondiscrimination, but it’s not a bad thing to see. This picture looks a lot better. The bottom picture looks a lot better than the top picture. So, again, what I’m suggesting is analyze error across important segments. When we see high variance in error across important segments, that can be a clue into all kinds of problems, one called under specification, which we’ll talk about just a little bit at the end. All right, another big breakthrough in residual analysis or just machine learning in general is Shapley values that allow us to have a somewhat accurate knowledge of how each feature contributed to each prediction.
We can do that for predictions and, if you’re patient or have a small enough data, you can do it for log loss. You can do it on the loss or error of the model. And so I can actually see in this case that pay three and pay two, third most recent and second most recent repayment statuses, are more important to errors or to the model loss than they are to model prediction. What does that even mean? So I think you’d probably want to consider dropping those variables and doing some tests around that. And you just never would have seen this looking at test AUC. And you can do this with the SHAP package in Python. I’m sure I used XGBoost. Okay, another oldie goldie trick is to model residuals. Here I fit a decision tree model to my people who did default, my residuals on my group of people who did default. It’s a very accurate single decision tree. Our square to .88, very low mean squared error. And this tree encodes when I’m likely to be wrong.
So if I can see somebody going down one of these paths in real time I could potentially take action, what’s known as an assertion, to deal with this. Moreover, I could study this tree and understand my common failure modes and either try to fix them on the back end or plan for those kind of failures when the model is deployed. Okay, so sensitivity analysis, residual analysis. Another big tool in our toolbox for model debugging is benchmark models. So just trusted, hopefully interpretable, models that we can compare our more complex models against. It helps us make progress in training if I’m able to take small, reproducible steps away from a benchmark. That’s one of the only ways I know to convince myself that when I do six lines of code changes in 100,000 lines of code pipeline. How do I know I did anything better? How did I know I improved it compared to the benchmark? That’s the only way I know, actually. And then, again, you can use it in a deployment setting.
You can use the benchmarking to help with training, you can use benchmarking to help with deployment, you can score people on a trusted benchmark, and you can score them on your very complex machine learning system. If the predictions are hugely different, you might want to pause before you issue that prediction. Especially on the kind of structured data, old school data mining problems I work on. Sorry, machine learning just doesn’t change that much away from linear models. And if you see a big difference between a linear model prediction and machine learning model prediction, that could be a clue that something bad is going on. You might want to stop before you issue that prediction. Have somebody take a look at it. It could be evidence of a hack. All right, so what can we do to fix that straw man, my G mono straw man, my monotonically constrained gradient boosting model? So I think the number one thing we could do is just collect better data.
And I think that goes for many machine learning projects. We can use experimental design to collect the data that we actually need as opposed to using data exhausts from other organizational processes to try to answer a difficult question. The model put too much emphasis on pay zero. Could have gotten better data, we could have done some feature engineering to try to draw that importance off pay zero a little bit, or we could have done some very strong regularization, like L0 regularization or missing value injection of pay zero to try to decrease the importance of page zero and spread the importance across other variables so I wasn’t just operating with a single business rule type model. Again, the sparsity problem. We needed more data, we needed people who defaulted, we needed more people who defaulted in the data to make better decisions about those people.
It’s a very common problem, but one that we shouldn’t just brush off. Maybe we could have increased observation weights for those people above pay zero equals one. Maybe that would’ve helped a little bit, not sure. This logical error where we were defaulting people whose payments were theoretically above their credit limit, we’re going to talk about model assertions a little bit more as a debugging fix, but essentially it’s a business rule. Run it real time to correct for logical errors like that in my model. That’s a practice that’s been done for decades in predictive modeling, using business rules to correct wrong outputs of mathematical functions so we can keep doing that. This model had some discrimination issues, potentially at least. And we could have done a better job picking a model with high accuracy and also low discrimination by intentionally selecting a model by accuracy and discrimination measurements.
There are these notions popular in the academic literature of pre, in, and post processing. I think those are great for the broader economy, but you have to be really, really careful with them in spaces like credit lending and employment where predictive modeling has been regulated for years. And some of these may actually violate existing regulations. So you want to be really careful with these in regulated industry. The most conservative thing to do is try to pick a model that just bounces accuracy and fairness. All these security vulnerabilities that we’ve talked through, some basic things help a lot. API throttling, not letting people get as many predictions as they want as fast as they want out of your API. I know that may sound a little bit strange because that’s why we make these APIs, fast predictions, but if they are not authenticated, if people are not authenticating to your API, then they could be stealing your model and your data. So throttling, slowing people down on the predictions, making sure people authenticate, and then monitoring the models for drift and anomalies.
That stomps down on a lot of those security issues. There’s more sophisticated things that can be done, for sure, but those three help a lot. And then, like I said, we could have evaluated dropping pay two and pay three because when we looked at it they were actually more important to the log loss than they were to the model predictions. So, again, never would have seen that just looking at test AUC. All right, we’ve talked about a lot of these throughout the presentation, so I’ll try not to harp on them, but there are things that we can do outside of the context of math to fix our models. And I brought up this notion of give people the ability to appeal inevitable wrong decisions. It will make your users and/or the general public feel much better about your machine learning models. To do that, though, takes transparency into how the system works. It’s very hard to appeal a black box decision.
People need to know how the decision was made, what data inputs were used, and then they can think about how to appeal a wrong decision. Again, there’s lots of things, like red TV, paying experts to come in and see if they can break your code. That’s done very commonly in a cyber and computer security realm. Why aren’t we doing that in machine learning? I don’t know. Bug bounties, same idea. Companies pay rewards to people who find bugs in their codes. Why aren’t we offering that for machine learning? I don’t know. This notion of increased demographic and professional diversity on machine learning teams, very hard nut to crack. Some companies are doing better than others, but the basic justification here is diverse teams spot different kinds of problems. That’s been my personal experience. And so I think that this is an actual quality issue if everyone on the team brings the same perspective to the development, deployment, and testing. You’re going to be blind to a lot of different kinds of problems, particularly on the algorithmic discrimination side.
Domain expertise, having people understand the business, super critical both in training and in testing. So people forget about the testing part. You want domain experts to be helping you structure your testing of your system to make sure that that testing is realistic. I’ve highlighted incident response plan here. This is probably the simplest, most direct, most impactful thing you can do. Have an incident response plan. Complex systems fail. It’s a well understood phenomenon. Like I said, most companies already know how to do incident response plans, they’re just not doing it for AI. It’s time to stop that, time to acknowledge these systems can fail, and just be ready for that. You can prevent a small glitch that only a few people noticed from spiraling into a major problem that gets eyes of journalists and regulators if you have an incident response plan. And you can do less harm if you have an incident response plan. We’ve talked about the IT governance and quality assurance. Just treat machine learning systems like other software in your company and apply the same good policies to those systems that you do to everything else.
We talked about model risk management.. Go read that paper, SR 11-7. It’s great, it’ll help you a lot. And then we talked a little bit about this, too, pass known incidents. Look at known failures. There’s lists all over the internet. I have one, the Partnership on AI has a much better one, an AI incident database. Try to look at these past incidents and don’t repeat them. It’s what we did for airplanes and many other kinds of transportation. It’s important. All right, so just in general what can we do to solve technical problems? Anomaly detection, just monitoring models for input and output anomalies, trying to suppress those as they happen or prevent them before they do happen, calibration to past data. Just because you run a bunch of data through a bunch of neural network weights and a number between zero and one that’s been soft maxed between zero and one comes out the end, that doesn’t make it a probability.
And so there’s another step of calibrating your model to actual past outcomes that we very rarely do in machine learning that makes our probabilities much more meaningful and probably our models much less error prone in the real world. So I would really think about calibrating to known probabilities and model outcomes. There’s a whole science of experimental design, about selecting the data that we need to address the hypotheses that we’re setting up when we build these machine learning models. We should be using experimental design much more than we are. Interpretable models and explainable AI, just quite simply it’s way easier to mitigate and understand risk that we can see than ones that we can’t. It’s not impossible to control risk for black boxes, it’s just harder. I brought up this issue of manual prediction limits earlier, but just why don’t we spend the time to test the bad things that machine learning models can do, the stupid, silly mistakes they can make, and just set manual limits to prevent those? I think that’s a great idea.
There’s this new notion of model editing where we directly edit the code or the functional form of our model to correct mistakes. I think that’s great as long as you’re careful with it, you know what you’re doing. We’ve talked about model monitoring enough, we’ve talked about the monotonicity constraints quite a bit. I’ll say that XGBoost, and probably other libraries, now also supports interaction constraints which can be particularly important in discrimination remediation. If I know two variables interact with each other to create a discriminatory effect, I can prevent that explicitly. And we’ve talked about strong regularization and this notion of injecting missing values or other ways to corrupt the data to deemphasize variables that are causing our model to go wrong. So I’ll leave it at that for now for these technical remediation strategies, but there’s lots that we can do. If we find bugs, there’s lots that we can do. All right, and just a few references and resources. So these are papers. If you haven’t read them, I really suggest reading them.
So this is the AI Incident Database paper. I think the title says it all, preventing repeated real world AI failures by cataloging incidents. Don’t make the same mistake that another company made, especially when it’s public, and you could just look at it and not do it again. These two papers talk about fundamental problems under specification, this notion that machine learning models are just highly complex entities that we have to constrain with our domain knowledge or, sadly, they don’t really work. And this is a paper with 40 Google researcher’s names on it and that’s in a nutshell what it says. So I would urge you to take a look at that. Another proceedings of the National Academy of Science paper that just says some things aren’t predictable. No matter how much data we have and no matter how sophisticated our models are, there are things about people especially that we just can’t predict.
So I’d urge you to have a look at that. And then I brought up model risk management many, many times, but I would, again, just highlight this is maybe the best paper on machine learning risk ever written. So go have a look at it if you haven’t seen it before. Here’s some additional resources out there on the web for model debugging and some tools that I find interesting or useful down here at the bottom. And that’s all and I really thank you for your attention today and please feel free to reach out if you found this interesting or if you have any questions. And thanks a lot for your time.

Patrick Hall

Patrick Hall is principal scientist at bnh.ai, a D.C.-based law firm specializing in AI and data analytics. Patrick also serves as visiting faculty at the George Washington University School of Busine...
Read more