A Fast Decision Rule Engine for Anomaly Detection

May 28, 2021 11:05 AM (PT)

Download Slides

Description: We present a supervised anomaly detection approach that is scalable and interpretable. It works with tabular data and searches over all decision rules for the anomaly class involving one or two features. It creates a classifier out of all rules meeting user-specified precision and recall constraints, classifying a test example as an anomaly if any of the rules fire. Overlapping decision rules can be pruned to reduce model complexity, leaving a small number of simple rules that a user can easily understand. Our system operates on Pandas DataFrames and has a high-performance C++ backend with experimental GPU and FPGA acceleration available. It is available open-source at https://github.com/jjthomas/rule_engine

 

 

In this session watch:
James Thomas, Student, Stanford University

 

Transcript

James Thomas: Hi, I’m James Thomas. And I’m going to talk about a fast decision rule engine for anomaly detection. Just as a brief overview of my presentation, we’re going to be introducing a classifier based on one in two feature decision rules as an interpretable approach for supervised anomaly detection. And even though this approach kind of involves a lot of computation, we made it practical because of the fast implementation that we have. And then afterwards, I’m going to get a little more to the details of our pandas API and then show a demo as well. Just as a overview of supervised anomaly detection problems, they are generally binary classification problems with a large class imbalance. Essentially often your training datasets, you’re going to have a very small number of actual anomalies, and a very large number of data points that are not anomalies at all. This can often cause normal ML methods to struggle. Methods that are used to a more balanced dataset of ones and zeros in the binary classification setting.
And we want interpretability often in these problems because humans are involved in addressing the anomalies. Your anomaly detection system is going to say, “Hey, this data point is an anomaly. You need to go do something about it.” And the human would love to know, “Okay, why was this classified as an anomaly?” So they can better understand how to deal with it. So these are kind of the characteristics of these sorts of problems. Now let’s jump right into the approach we’re going to take here with decision rules for categorical tabular data. We’re going to be working with this kind of data and I’m going to show you, just in a little bit, how to extend that to a continuous variables as well, but we’ll start with categorical variables for now. So consider this example where you have cell phone telemetry data.
You might have a few different features that you get from all the different cell phones that you have deployed around the world. For example, you might have the operating system version, you might know the manufacturer, the age of the device, the region the device is operating in. And in reality, you might have many more features than this, but for this example, we’ll just look at these four. And then your class variable here, whether or not you have an anomaly essentially, is this has error column. Was there an error during the phone’s operation recently? You can see that in this dataset. We’ve got two anomalies and four non anomalies. Let’s see if we can come up with a decision rule to classify our anomalies.
One potential decision rule that involves just a single feature is the O operating system version is equal to 4.1. So that’s going to retrieve the four rows that you see here highlighted in green. And it turns out that two of those are anomalies out of the four. So we have the precision of 50%, as you can see at the bottom here, and we’ve returned four total examples. So this is an okay rule, giving us decent performance. Let’s see if we can do any better. A second rule is this one, manufacturer is equal to Samsung. That’s just returning one row. And it turns out that is back in error, so we’ve got a precision of one out of one or a hundred percent. We only returned one example. So maybe we don’t think this rule could generalize because it’s kind of specific to just a single example, but it does give us high precision. All right. Now what about two feature rules? What do we mean by that?
Well, here’s an example of one of those. If the operating system version is 4.1 and the device age is equal to one, then we return three different rows and it turns out two of them have the errors. So that’s a precision of 66% and we’ve returned three total examples. So this is quite a good rule, slightly higher precision than the first one, even though it returns some fewer examples. That’s kind of an overview of the one and two feature decision rules I was talking about, to return anomalies from the dataset. You can see pretty clearly that these rules are interpretable, especially when they’re limited to one or two features. If an analyst sees that, “Okay, the operating system version is 4.1 and the device age is one, it turns out there’s a large number of anomalies.” They can kind of make heads or tails of this. They can see, “Oh, that operating system version might not work well with the newer devices.” It’s kind of easy to comprehend and come up with a story for this.
And when you start adding way more features in your decision rule, that’s when it’s harder for a human to really understand the interactions there and come up with a story about what’s going on. You can look at similar methods to this one, like decision trees, for example, especially ones that are quite deep or random forests, which are kind of an amalgamation of many decision trees. And even linear models, they’re hard to understand because they don’t kind of easily split up or kind of individually package up interactions among just one or two features. They kind of have all the features combined into one big, big thing that you have to kind of parse or understand yourself. That’s the advantage here. So just on the side now, as I was saying, we can extend decision rules to continuous features once they have a continuous range of values instead of a fixed number of distinguished distinct values.
And the way we do this, as we find the min and max of the continuous feature range, and we just discretize that range into 15 equally sized buckets, is what we do in our system. We chose the value 15, because it kind of works well in practice. And it also helps us with the hardware acceleration that I’ll talk about briefly in a bit. So you turn your continuous range now into 15 different buckets. And there’s other dicretization schemes you can do definitely, that are not just this kind of linear bucketing, but this is what we’ve done for this work. And it’s definitely a possibility in the future direction for us to add more different schemes.
I just talked about single decision rules. And to get good performance as you might expect, we need to combine the social rules because a single decision rule is unlikely to be enough to classify across your whole range of examples. In this small example I showed, you might have a rule like manufacturers equal to Samsung. And yes, it has pretty high precision, but it doesn’t cover your whole range of anomalies. So what we want to do is we want to create a classifier with many good decision rules that kind of cover the entire space of anomalies. And then if any of them fires on an example, if any of these decision rules fires or ends up being true, then we say that anomaly is detected. And more formally, we’re kind of taking the logical OR of all of the rules that we add to our classifier.
And this is still interpretable. We said that a single decision rule is of course, very interpretable. It’s still interpretable when you have a classifier with many decisions because a human can kind of see which rules fired for particular example. And they can kind of analyze what those rules are doing, because each rule is pretty simple. And what I’ll talk about a little bit later is also, we try to keep a fairly small number of overall rules in our classifier. We try to prune out kind of redundant rules so that you can look at all the rules in your classifier if you’d like, and kind of understand them and it’s not an overwhelming number of them. So how do we evaluate decision rules and choose good ones? Because we said, we want to just pick good ones to put into our classifier. Well, we used a bunch of different counts that I’m going to talk about here.
We’re going to maintain account of anomalies and total examples classified, for all possible one or two feature decision rules or the forms that I was showing before, so equality decision goals. So what is our table going to look like with all these counts? Well, you’re going to have some entries in your table that only involve a single feature. So that’s like the first row in this table where the feature is OS version, the value is 4.1, there’s no second feature. And then, if you go back to our dataset, you can see, okay, that one gives two actual anomalies and returns scored total examples, as I was showing earlier, giving us a position of 0.5. And now we can look at, okay, for OS version is 4.1, what are the two feature rules involving that one?
So then we can bring in region, for example, region is equal to Asia, and has the number of anomalies, and total examples returned that you see there, and the precision, and so on and so forth. So you can imagine that this table ends up getting quite large, because especially for the pairs of features, you have kind of the cross product of all features and then all values within each pair of features as well. So we’re computing all of these counts, and then we’re going to use this information to select good rules. So this table ends up getting quite big. That’s not what I was alluding to. When you have these two feature counts, the computation ends up getting quite expensive to compute this large table. The way we’re able to make this approach practical is we’ve got a fast C++ implementation that’s fully paralyzed to compute all these counts. And we also have experimental GPU and FPG acceleration available. You can see this all at the GitHub that I’ll link in a little bit.
In practice, we can scale to the thousand feature range, even for fairly large datasets. So you can generate new features from the existing data that you have, if you want to scale up a little bit and see if you can come up with better decision rules. Or you should be able to work with fairly large data for this method. How do we actually select decision rules based on the table that I showed before? Well, we’re going to filter on the precision and the total count of examples returned by each decision rule. So in particular, we’re going to create a classifier with all decision rules, having precision greater than some user specified P thresh and total examples returned by the rule of greater than some user specified C thresh.
That’s going to give us a classifier with a large number of rules and it’s going to operate the way that I mentioned earlier. But as it was kind of alluding to as well, you likely want to prune decision rules from your classifier. Because this scheme is… You might have fairly tight thresholds, but it’s still probably going to return a fair number of rules. And in fact, the overall classifier is likely to have lower than P thresh positions. Remember, P threshold is the threshold to select individual rules and you’re combining all of them together. The overall classifier with the combination of all those rules is likely to have lower than that precision, because there’ll be more overlap in the selected rules, actual true anomalies than in the false positives that they return. So if you think about that, you can see that the net precision goes down. And this isn’t always true, but it’s often the case.
So what we want to do is we want to prune redundant rules that are kind of returning similar sets of anomalies. We don’t need both rules because they can return different sets of false positives and product decision. And furthermore, by pruning these kind of overlapping and redundant rules, we’re also going to have fewer overall rules in our classifier, and it’s going to make it easier for the user to actually look at their whole class for all the rules and comprehend them, so that’s something we want to achieve. There’s many ways to do this pruning, and one heuristic that we’ve implemented in this work is sort all the rules that you’ve selected for your classifier according to the filter from before you sort them to sending by the total number of examples that they retrieved or classified. And then you iterate through these rules and you compute the incremental precision of each rule.
So the number of new anomalies it classifies, over all the previous rules that were kind of above it in the iteration, divided by the number of total new examples that it classifies compared to previous rules. And we’re going to discard rules that have incremental precision less than some user specified P thresh prime, and incremental examples classified greater than C thresh prime. You can kind of see how this is discarding rules that don’t really do much compared to the previous rules that we’ve looked at. In particular with this heuristic, when you set the P thresh prime to be very small, a little bit larger than zero, like 10 to the minus two or something like that, and you set the C thresh prime to be one, all this does is eliminate rules that have no correct incremental classifications on the training set.
So this basically means that the only rules you’re going to throw out are ones that don’t classify any new or true anomalies at all beyond the prior rules. As you could infer from that, what that’s going to do is strictly improve the classifier’s precision on the training set with no change to recall. You’re still classifying all the true anomalies that you would have on the training set, but you’re actually eliminating rules that may have just brought in extra false positives. And so other heuristics beyond this kind of sorting scheme to prune decision when is it possible, but this is the one we’ve implemented in this way. Now finally, I’m going to give you a quick summary of the pandas API before getting into the demo. First, we have a method called compute sums that basically computes that entire paralyzed table I was showing before on the training set. You have to give it also the name of your class variable.
Then we returned this sum object S and then with as you can get rules according to the P thresh and C thresh that you want, and that’s going to return a data frame of all the rules that kind of pass those thresholds on the training set. And then we have a prune method, like I was talking about where you can give it your existing lists of rules, you can give it a set of examples to use when determining kind of incremental performance that generally you want that to be the training set, and then you give it to P thresh prime, C thresh prime, and so on. You can display the rules in kind of an easily human readable format. And you can also evaluate your set of rules, which is essentially a classifier on some test set to return the precision and the recall on the anomaly class for that class spot. That’s our API summary.
Now I’d like to get into a demo and you can see that the GitHub is here for this tool, so you can take a look and try it out yourself, and I’ll show you how to do it in the demo. And you can also get in touch with me, this email address. All right. So let’s take a look at a demo of our rule engine systems. We’re assuming here that you’ve gotten it already from GitHub and you’ve installed it. If you’re on macOS like I am here, you’ve installed the LLVM dependency with brew, and then you’ve just PIP installed the system. We’re going to get into the virtual environment that we have here, where we’ve installed everything. There we go. We’re inside of our virtual environment. And now we’re going to get into the example’s directory, and I’m going to show you the rookies examples. Let’s just take a look at that example first.
Here is the code for that, and I’m going to walk through a step-by-step in an interpreter. So let’s get a new interpreter here. First, we want to import our rule engine as RE and then import pandas as PD. And so we’ve got this dataset about basically NBA rookies. So essentially, all players that have ever been in the NBA until 2017, we’ve got their rookie stats. So 30 or so different statistics on them. And then the variable we’re trying to predict is if they ever become an All-Star at any point in their career. So only 10% of NBA players ever become all-stars, so sort of an anomaly detection problem, in a sense. Let’s read that CSV rookies first. Let’s print this thing, you can see it’s got around 4,000 rows of different players, and you can see it’s got several different stats, and we’ve got this if all-star column at the end.
Now let’s compute all the sums over this data as it was showing before. So we want to do RE.compute sums of DF, and the if all-star column. Okay, so this essentially prints out a summary. It computes the sum, it also prints out a summary of all the different columns in the dataset. So it doesn’t use the player column, for example, that makes sense, because you don’t want to use the player name to predict anything. So it sees that the cardinality of this column’s too high and discards it, and then it shows you, okay, was the variable continuous? And if so, what was the range used to discretize it, as I said before, or was it categorical? So you can see we’ve got a bunch of different statistics, like your standard win shares, field goals, so on and so forth. And it also shows the overall mean of the dataset. As I was saying, it’s about 10% of players that have the one and the if all-star column, they became all-stars.
Okay, so that is the first step. And now we want to just get the rules, that meet a particular threshold. So let’s get rules that have at least the precision of 0.7, and classify at least 10 distinct examples in our training set. So these are one feature, two feature rules. Okay. So we’ve got the rules sitting right here, and this is also a data frame. You can see this isn’t super interpretable, and that’s why we got this display method to print out the rules. But you can see it’s similar to that table I showed in the presentation where it’s got the first feature, the value of that feature, the second feature, if there is a second feature in the rule, if there’s not a second feature in the rule, it’s going to be a minus one. The second feature’s value. The total counts of examples that are classified by this rule, and the positive fraction, which is the same as the precision.
And this is sorted by the total number of examples to classify. Let’s see how we… This is a fair number of rules and let’s see how we did on this. If we do evaluate summary, and let’s just see what our API is there, we want to pass in our data frame. So we’re just going to evaluate this on the training separately. You could have a separate test set if you want. So you can see that this gets about 56% precision overall, which is, as I said earlier, this is likely to be the case when you amalgamate many rules, the overall precision of the classifier goes below the precision of individual rules. So here overall is 56 even though the precision of individual rules is 70%. And then the recall’s about 60% of the actual all-stars are recalled here.
This is pretty good performance because this is kind of a hard problem, but let’s see if we can prove some of the rules that are kind of giving us only false positives and not really adding any new, true classifications to our results on the training set. We can do this prune rules here with… Rules stop prune, rules and a data frame or… Let me just double check that. Yep. And we’re going to use the small value of P thresh, and this one value of count to do exactly what I was saying in the presentation. Just prune out only stuff that gives us false positives and no new anomalies at all. So this is pretty safe to do. We’re not going to prune any potentially good rules generally.
Okay. Now you can see we’ve only got 60 rules, which is kind of nice. We’ve got 60 rules, it’s a smaller number, and we’re going to now evaluate summary on the DF. And you can see that we’ve got the exact same recall here. We’re still recalling about 60% of the true all-stars, but we have a better precedent because you want to remove rules that really weren’t at any extra value. And now finally, what we can do is display rules. So we can get kind of a human readable output here. You can see all the different rules. And again, it’s sorted by the total number of examples classified. You can see kind of the heavy hitter rules, they’re classifying a lot of stuff correctly at the top. You can see, okay, if the games played was from 77 to 83, and the player efficiency rating was in this range, you actually, 72% of the time, this player, if they had those lucky stats, will go on to the all-stars.
And so you can see a bunch of other rule. There’s only 60 of them so it’s pretty easy to just go in and analyze what’s going on. And you can see anything with a comma here has two rules, and then there’s some that, like this one, it only has a single or two features, and some like this one that only has a single feature involved in the rule. That’s that, and hopefully you try this out and find it useful. Thanks a lot. All right. Thanks again, everyone for listening to my presentation. And once again, you can take a look at the repo here at this GitHub link. And if you have any issues at all, please send me an email at this address. Thanks again.

James Thomas

James is a fifth-year PhD student at Stanford University working with Matei Zaharia and Pat Hanrahan. His research interests are in hardware acceleration for data applications. His research has appear...
Read more