data brew logo

EPISODE 3

3 T’s to Securing AI Systems: Tests, tests, and more tests

What does it mean to make your machine learning system “production-ready”? Yaron Singer walks us through the infrastructure, testing procedures, and more that help make ML systems ready for the real world in this episode of Data Brew.

Yaron Singer
Yaron Singer is the CEO and co-founder of Robust Intelligence, and the Gordon McKay Professor of Computer Science and Applied Mathematics at Harvard University. Before Harvard he was a researcher at Google and obtained his PhD from UC Berkeley. He is the recipient of the NSF CAREER award, the Sloan fellowship, Facebook faculty award, Google faculty award, 2012 Best Student Paper Award at the ACM conference on Web Search and Data Mining, the 2010 Facebook Graduate Fellowship, the 2009 Microsoft Research PhD Fellowship.

Video Transcript

The Beans, Pre-Brewing

Denny Lee (00:06):
Welcome to Data Brew by Databricks with Denny and Brooke. The series allows us to explore various topics in the data and AI community. Whether we’re talking about data engineering or data science, we’ll interview subject matter experts to dive deeper into these topics. And while we’re at it, we’re going to enjoy a morning or afternoon brew. I’m Denny Lee, I’m a developer advocate at Databricks and one half of Data Brew.

Brooke Wenig (00:30):
And I’m Brooke Wenig, machine learning practice lead at Databricks and the other half of Data Brew. For this episode, we’d like to introduce Yaron Singer, CEO of Robust Intelligence and professor of computer science at Harvard University. In this session, we’ll be discussing how to secure AI systems and everything that entails. But before we dive into that, I would love it if Yaron could introduce himself and how he got into the field of machine learning.

Yaron Singer (00:52):
Great. Well, hi Denny, hi Brooke. Thanks for having me here. Yeah, my adventures into machine learning started some time ago. I was really interested in computer science and I was really interested in algorithms and I got my start in more of theoretical computer science, developing efficient algorithms. For me, something that has always been really, really fascinating throughout this journey is the real world and real data. Something that I’ve always been really invested in is understanding how the algorithms that we develop, how they actually interact with the real world and real data.

Yaron Singer (01:35):
I think this is really how I got exposed to machine learning where basically when you really want to understand this, then you realize all these statistical properties in the world that come up and all the possibilities and the things that you can do and these fascinating questions that arise.

Yaron Singer (02:00):
Yeah, so that’s how I got started into it. I did my PhD at Berkeley on this topic and yeah, then went from there.

Denny Lee (02:13):
How then did you go from machine learning into basically talking about the inherent operational risks of AI systems? And can you introduce that concept for that matter?

Yaron Singer (02:23):
Yeah, absolutely. All right, maybe I’ll introduce the first and then I’ll go into how I got into it. But I think, yeah, the concept of securing machine learning is basically what that means, is it means the inherent problem that we have is that machine learning is extremely sensitive to very, very small changes in the underlying data. And these small changes could be due to various reasons. It could be due to the fact that our data is different from data that we’ve seen in the past or our training data has not been properly collected. Maybe there’s human error, and it may be there is an adversarial player that’s injecting bad data into it.

Yaron Singer (03:08):
All these reasons basically create these very, very small changes in the data that can really, really fool our state-of-the-art AI models. That’s when we’re talking about securing AI or secure machine learning, this is what we mean. We mean the ability to make sure that the data that’s coming into the models is not going to have an adverse effect on them.

Yaron Singer (03:32):
How I got into it is basically, I think it’s been a while ago and back in the day I had my own little startup adventure and then later on I went to work for Google and I saw lot of this interaction between machine learning and algorithms. Specifically what you see when you’re working on these systems, if you’re thinking about any machine learning system that you’re actually working on, what you quickly notice is you notice how these very, very small changes in the data can really affect your decisions.

Yaron Singer (04:19):
What I found out when I was working at these places, I found out that we were spending all this energy on developing these really, really smart algorithms. You can think about algorithms for things like marketing and social networks. You can think about algorithms for problems like ad words, but all these algorithms, what they depend on, they take this input, they take machine learning predictions. When you see how sensitive these machine learning predictions are and how they change all your hard work in optimization and how they completely change your decisions, you realize that the problem that you really should be focusing on is understanding the sensitivity of machine learning models through these small changes and observations.

Yaron Singer (05:08):
Basically yeah, my entire career at Harvard has been on this. I’ve been working on this at Harvard for probably seven or eight years on exactly understanding how to develop noise robust machine learning algorithms. So there we go.

Brooke Wenig (05:28):
Yeah, I know that’s a very big problem. I mean, you read all these papers, one pixel attacks, one pixel can completely change the output of the classifier. Can you talk a little bit more about your work and how you’re actually attacking this problem?

Yaron Singer (05:44):
Yeah. I mean, yes, absolutely. I think at Robust Intelligence, there are two sides of this. The first side is attacking machine learning models, and by attacking this could be by one pixel changes, but it could also be by sending an image with the wrong dimensions to a model or sending a data point that has a missing feature. We have a very deep understanding about what fools machine learning models and why.

Yaron Singer (06:16):
Then the other side of it, obviously what we do is we’re building an AI firewall and what that AI firewall does is exactly handles that. Basically, it’s you think about this as a wrapper around your existing model. What that does is basically it has the ability to give you a confidence score about the data point coming in, the likelihood of that data point fooling your model. Basically just fooling your model.

Denny Lee (06:49):
Got you. One of the things that’s related to this that I want to think a little bit about, is most data validation checks, if any, is in the pre-processing logic. We have, for example, is it the correct data type? What issues do you basically see with this type of approach?

Yaron Singer (07:08):
Yeah, I mean, that’s a great question. All right, first of all, what do we mean when we talk about pre-processing? First of all, let’s take neural networks as an example. Normally the way that data goes into what we think about as the neural network or the machine learning model, goes through really pre-processing phase. That pre-processing phase, largely what it does is it takes a raw data point and then transforms it into normally it’s numerical features that can be used by the neural network or our modern machine learning model.

Yaron Singer (07:54):
That pre-processing box, what it does is it does this translation. In addition to this translation, it does some very basic input validation. That’s something that is obviously very, very good. It could do things that are important like if you’re passing on a value that is maybe a very large integer that can create integer overflow or something like that, then it can handle that. Or maybe if you’re passing an integer inside a string, it can basically translate that string into an integer without problem. It can basically do those things.

Yaron Singer (08:32):
But the things that it cannot do is basically it doesn’t have an understanding of the model and it doesn’t have an understanding of the data. For example, pre-processing doesn’t know, for example, if there’s a categorical that the model hasn’t hasn’t seen. It doesn’t know if one of your features is age and you feed in age one million, well, that doesn’t make sense. But in order to understand that doesn’t make sense, you need to understand in distribution that the data is coming from and understand that well, one million doesn’t really make sense.

Yaron Singer (09:09):
There are a gazillion more very nuanced things that pre-processing just doesn’t do. And not only it doesn’t do it right now, but in order to do that, well, what you actually need to do, is you actually need to build a system that is trained on the model and the data and puts those together and then has context about what is the right data points and what are the data points that fool the model. And what are the data points that maybe they’re incorrect, but they’re not going to affect the model. You also want to know about that. You don’t want to have these false positives and that’s why you need a firewall.

Brooke Wenig (09:49):
This actually transitions really nicely into the concept of testing machine learning applications, because generally software engineering people talk about test driven development or test first development. With data science I’m happy if somebody even writes a test. What are your thoughts about writing tests for machine learning applications?

Yaron Singer (10:06):
Yeah, uh many. I think that this is somewhere where now us as a field, machine learning as a field, I think this is where we have a long way to go. By the way, there was a beautiful talk that was at NEURIPS, I think it was by [inaudible 00:10:33]. It was for the test of time award and basically gave this really great talk about machine learning as an engineering discipline and all the ways that we still have to go in order to have machine learning be an engineering discipline that we trust in the same way that we trust other disciplines like maybe civil engineering or other engineering disciplines where we understand that things are at stake.

Yaron Singer (11:16):
Along those lines then, one of the things that can help us with developing rigorous machine learning systems is by testing. Testing our machine learning models, testing our data, things like that. Right now, it’s done in a very ad hoc way. I think about this a lot, why? In my business I talk to a lot of data scientists from a lot of different companies, from all ends of the spectrum. I really try to understand the culture, I really try to understand the practices and understand well, how can we make the organization not only have better AUC results, but how can we make this a more rigorous, better disciplined organization?

Yaron Singer (12:14):
I think that the reasons have to do with culture and maturity. If we’re comparing to software engineering, then I think the culture of machine learning is a little bit research-y. We really enjoy taking these statistical models, playing around with them, seeing some results. I think that’s been a driver of machine learning and there’s also a great deal of innovation and it’s constantly in flux and changing. So I think inherently in the culture, it feels less rigorous. I think that’s one part of it.

Yaron Singer (13:00):
The other part of it is maturity. If we think about machine learning as an engineering discipline, that’s relatively a new thing. When I was in grad school, the number of companies that had machine learning as part of their core business, was not a great deal. You’d use this for things like fraud detection and spam detection. These are very, very big companies, but it’s not that pretty much every software company that you talked to had a data science team. But now, 10 years later, all of a sudden the world is really changing. I think what we also have to understand is that we have to understand that as an engineering discipline, machine learning is a pretty young discipline.

Yaron Singer (13:44):
I think those two things together. We don’t follow the rigorous processes or in most cases, most companies, most organizations right now, they still are not following the rigorous processes that we’d expect in software development. Well, those things are changing. I think as people are developing AI models and they’re making decisions based on these AI models, and they understand the consequences and the risk involved, then I think those practices are changing.

Brooke Wenig (14:26):
Yeah. And that’s also a really interesting discussion just about the split between machine learning and engineering, because actually most of the people that get into data science and machine learning don’t come from a pure engineering background. Even computer science at many schools, isn’t considered engineering. Data scientists are entering it from math, from physics, from computer science, many other fields as well, social sciences. In those fields, the concept of test driven development isn’t often taught in courses. But do you think it’s possible to do test driven development for data science, or do you think that models evolve and change so much that it’s too difficult to write the tests up front?

Yaron Singer (15:06):
Oh, yes. I absolutely think that test driven approach in data science is yes, absolutely, I do. It’s harder. Let’s recognize the fact that this is harder, but it’s definitely doable. Now, why is it harder? It’s harder because when we’re writing software, normally in most cases, it’s more deterministic. We know how to break it up, it’s interpretable, we can look at the code, we understand it. We understand what every component needs to do and it’s easier. Then we can take it by chunks. It’s easier for us to test software and create these unit tests.

Yaron Singer (15:50):
When we’re dealing with AI models, in many cases the models that we’re using off the shelf, because it’s difficult to train them. Maybe you don’t have the data, maybe you don’t have the resources and what not. Now we’re interacting with this black box. And this black box, it could be not deterministic, it’s a statistical entity. The universe of possible unit tests seems infinite and that’s really hard. I don’t understand this box that I now need to test and what am I testing for? What are all the things that can happen?

Yaron Singer (16:38):
Actually writing tests actually requires you to actually now develop algorithms that will fool this black box entity. So it’s definitely harder, but is it doable? Yes, it’s doable. It’s doable if you spend the time, the effort to do it, then you can do it and you can test your model and you can basically expose all its perks, all its vulnerabilities, things like that.

Yaron Singer (17:13):
At Robust Intelligence, we think that this is a big ask for data science teams. We know how long it takes to develop these tests in a thoughtful, rigorous way. What we anticipate organizations to do, we anticipate them to use products that do that rather than sending their data science teams for six months or whatnot to develop these unit tests.

Denny Lee (17:41):
No, that makes a ton of sense. In fact, actually you’re reminding me, now I’m pulling a little bit of my own past here. When we try to even just do validation tests on BI systems, just at the time that we started it was like, “Oh, but all the Cartesian of all possible queries.” Then of course over time you develop that rigor of like, “Oh yeah, well, we’re actually doing validation tests, so you don’t actually have to get every single value. You just need to have tests that are representative of that.” Exactly to your point, as we mature, we’ll start realizing, okay, maybe you don’t need to have every permutation out there.

Denny Lee (18:19):
But that does relate to actually my next question, which is well then, especially because you’re doing this a little bit over at Robust Intelligence and we’ll actually want to actually have you describe a little bit more about your company shortly, but what are some of those biggest challenges that you see people face when they’re trying to get these models production ready?

Denny Lee (18:37):
I mean, obviously we’ve alluded to that from the testing perspective. I’m just curious, are there other aspects that you feel that should be called out right from the get go?

Yaron Singer (18:46):
Well, when you’re looking at this more globally, a lot of times we’re interacting with the VP of the organization, the VP of the data science, or someone who’s the director level. One of the biggest challenges that they have is with having an understanding of what their models are doing. They want to have some sort of oversight, they want to have some sense of an inventory of the models. Then have a sense of the inventory of the models. They want to understand what are the different things that the team has try tested and why they’re using what they’re using. They want to have an understanding. They want basically some sort of visibility, some kind of observation on I think the quality of the models as a whole.

Yaron Singer (19:45):
I think that’s a big challenge. And then the other thing is very much related to that, is standardization of practices and quality. You don’t want one team developing these types of tests and looking out for these cases and another team looking out for something else. You want some standardization. I think the other aspect of it that’s related is about understanding what assumptions the model’s making. Actually, this is super a important thing especially in larger organizations. What you see in a lot of these large organizations, is somebody develops a model, there are all these assumptions baked into it. For example, like, “Oh, I’m assuming that pre-processing does this. And I’m assuming that the data’s never capitalized. I’m assuming that this and this and this and this and this and this.” There’s no document that ever gets written about this. And even if there were, nobody would read it.

Yaron Singer (21:01):
Now that person hands off the model to someone else. That gets used in production. Nobody knows what are the assumptions this model’s making. Now all these things are fed into it, it breaks and then you have to call in that person to come in, fix, do all this firefighting. This understanding the assumptions that the model’s making, especially as these models, they move around the organization from one team to another, that is actually something that’s really, really critical. That’s a point where a lot of things break. So, yeah.

Brooke Wenig (21:44):
In terms of communicating these pre-processing steps, I see this happen quite a lot with our customers. They’ll have a data engineering team, they’ll ingest the data, and they’ll do some very basic pre-processing. They think they’re helping the data scientists out by either dropping the missing values, or just blindly imputing it with the mean. How do you suggest best communicating these pre-processing steps? Like you’d said, it’s very difficult to communicate, especially if the responsibility transfers across teams or the code evolves through many people working on it. How do you internally keep track of it? Or how do you see your customers keeping track of what pre-processing steps are actually required by the model?

Yaron Singer (22:17):
Yeah, I mean, I think this is exactly like testing. We can’t stress this more. Run models through tests, you test the models and then according to these tests, these tests expose exactly automatically, they expose what the model is assuming. Is a model assuming that now this input is a numerical feature, is this model assuming that this is a unique ID? All these things, they have to become exposed in some automated way. What we’re recommending is, either buy or build just some validation method that tests the model and exposes exactly what the model is assuming.

Yaron Singer (23:12):
Before you’re going to put your model into production, you look through these assumptions and then you see whether that suits the input that the model is now going to take. This is what we’re recommending.

Brooke Wenig (23:35):
Got it. I know that the vast majority of data science models never end up making it to production. I think the number’s around 90-ish percent. What are some of the main reasons why you see many models not actually make it into production?

Yaron Singer (23:49):
Yeah, I think that’s a good question. I think what we also need to ask ourselves is, “What do we mean when we say production?” I think so many customers that we talk to, they’re like, “Oh, well, my model doesn’t go into production, so something, something, something.” I think that our idea of production is normally we’re thinking about a model that’s sitting in some very dark room on a server and it’s taking in data and making decisions autonomously and just running like that.

Yaron Singer (24:30):
Well, in a lot of cases you find companies, they don’t have a model in production, but what they do is on a weekly basis, they run a model on the new data set and they send that to their customers. In some cases they have this as an automated process and in some cases they retrain the model. And they don’t think about it as production because they think, “Oh, well, there’s always maybe a human there.” Then maybe in some cases someone who is also validating the results.

Yaron Singer (25:06):
But I think these are still cases that are interesting and still cases where the results of the models are being consumed by people who are making decisions. I think if we’re thinking like, “Why don’t we see more models get into production in the way that we think about them,” I think it’s just a matter of time. I think that again, we have to remember just how new the discipline is. If we think about this five years from now, are we going to see more automated decisions getting made by models without human intervention? Absolutely, we are.

Yaron Singer (25:47):
And by the way, and Databricks is really pushing the envelope on this. We see with Databricks Notebooks and the way they make it so easy for people to automatically retrain their models, then two years from now, five years from now, if now the numbers are within less than 10% of models getting into production, then I think that we’re going to see 50% of models are going to be in production in the way that we’re thinking about production these days.

Denny Lee (26:13):
This thought was really interesting. As you’re noting the fact that you were segueing to automation, and then just like you said, there’s 50% of the models are going to go into production as more and more going in. But then the one thing that I’m just curious about is because of that automation and you alluded to this before, does this imply that we actually now also need to really look out for the biases that we place in our AI models? Because whether it’s the pre-processing steps, the fact is that we’re automating these things. Inherently it’s a human that’s writing it down, so their inherent biases may in fact end up being automated. I’m just curious from your perspective.

Yaron Singer (26:59):
Oh, yeah. I think that’s a great question. I think that’s absolutely we do. One thing that we have to realize is the fact that AI is in the world and the fact that like we’re going towards a world where AI is responsible for decisions being made automatically without human intervention, that’s a fact. And we also have to understand, I don’t want to geek out too much, but if you give me 10 seconds, you allow me. If we think about when you stare into the eyes of machine learning, when you understand why it works, when you understand by the way, the brilliance of this, the brilliance of this path vulnerability and all this, it’s something that’s supposed to work well on average by definition. It can work really well on average and we can have really great confidence in our goals and it can do all those great things when we have a lot of data and all this.

Yaron Singer (28:09):
But by definition, these things are not supposed to handle the outliers and the worst case instances. Whenever we’re thinking about bad decisions that we don’t want to be made by these AI models, it is the corner cases. Whether it’s a corner case because of malice, that somebody is taking advantage as we saw the Microsoft chat bot a few years ago, if you guys remember. When people were poisoning it to make and say racist slurs and all that. I think that was a good exercise for humanity to understand how these things can go wrong in a matter of just hours. Of course we’ve seen other examples with Google and then the image recognition mishaps, things like that.

Yaron Singer (29:16):
The bias is almost inherent in this approach. I think what it means is it calls for us to basically protect against these corner cases and then make sure that these corner cases are handled before they go into these statistical models that do well on average.

Brooke Wenig (29:42):
Yeah. I mean, that’s a really excellent point about machine learning systems. Even when you evaluate them, you evaluate them in aggregate, you don’t generally look at the individual error analysis of which records you misclassified.

Yaron Singer (29:52):
Exactly, exactly. That’s exactly right. We don’t evaluate a machine learning model based on its worst prediction because if we do that and we train machine learning models to do well on their worst behavior, then we’re not going to go anywhere. Just anecdotally, there’s this example from the world of adversarial noise and all the work that has gone there, which has been really, really fantastic. But the approach there has been I think it was almost this thought experiment that went on for a few years about like, “Let’s do adversarial training.” Meaning we will take our existing machine learning methods for training models and what we’ll do is, we’ll just feed in all this adversarial data into them and then see the performance of the model.

Yaron Singer (30:55):
I think the best that we know how to do right now reflect these image recognition models, is I think we have accuracy on the order of 37% and this is in comparison to what, 90%, 98%, 99% without this? It’s hard to imagine that someone is going to happily adopt a model that does instead of 99%, does 37% because it can handle these crazy corner cases. From everything that we’ve seen, that’s really inherent into machine learning, into the method that works. And again, if we’re going back to as an engineering discipline, we can’t not fix this. So it’s on us.

Brooke Wenig (31:55):
Yeah, and I guess the inherent problem as well with corner cases is if you build a model that can handle all the corner cases, you’re likely over fitting to the corner cases you’ve already seen. So it’s a problem that’s just really difficult to solve. I’m curious to see what comes out in the future.

Brooke Wenig (32:10):
But as a wrap up here, two questions for you. One: what advice or best practices do you want to leave our listeners with today, and two: could you do a quick pitch of Robust Intelligence for everyone?

Yaron Singer (32:22):
Absolutely. I was hoping you were going to ask for my T-shirt size, but okay. Okay, that’s a wish. For the first question, yeah, my pitch to companies is in general, I think be rigorous in your practices. Insist on unit testing, insist on standardization, insist on visibility. Count the number of times that you’re firefighting and jumping into the models and realize that that’s not a good thing. Think about your KPRs, include not only accuracy of the models, but include the rigorousness of this and be well aware of all the consequences to the business. Not to say humanity, but to your business when you’re thinking about when these models fail. This is the advice that I have for organizations. And of course, use Robust Intelligence.

Yaron Singer (33:38):
What is Robust Intelligence? Yeah, what we’re doing at Robust Intelligence is we’re building an AI firewall. And what this AI firewall does, again, you can think about this as a wrapper on your model. You have data point coming in and then this AI firewall gives you a quality score of the data and indicating the likelihood of this data point fooling your model. The way that we do this is, we do this in stages. The first thing that we do is we train this using stress testing. We stress test the model, we identify all the vulnerabilities, all the issues that it has with the data that it’s supposed to take in. Then based off of that, we’re training these firewalls.

Yaron Singer (34:24):
Yeah, for anyone out there that is interested in improving the rigor of your system, Robust Intelligence is waiting for you.

Brooke Wenig (34:34):
Sounds very robust. Just had to do that.

Yaron Singer (34:39):
Yes, yes. No, we get that all the time. That’s good, it’s good.

Brooke Wenig (34:43):
Well, thank you so much for joining us today on Data Brew, Yaron. We definitely enjoyed our conversation, learned a lot, and it’s good to know that test driven development is possible with machine learning as well.

Yaron Singer (34:52):
Amazing. Thank you guys. Really enjoyed it. Thanks.