Shapley algorithm is an interpretation algorithm that is well-recognized by both the industry and academia. However, given its exponential runtime complexity and existing implementations taking a very long time to generate feature contributions for a single instance, it has found limited practical use in the industry. In order to explain model predictions at scale, we implemented the Shapley IME algorithm in Spark. To our knowledge, this is the first spark implementation of the Shapley algorithm that scales to large datasets and can work with most ML model objects.
– Hi, and welcome to Spark AI summit. We are going to be talking today about our open source package Shparkley, which is scaling Shapley values with Apache Spark. I am Cristine Dewar and I’m with Xiang from Affirm and we’re going to talk to you about this. Hi. So our agenda today is we’re going to go through an introduction. We’re going to talk about what is a Shapley value. How do we implement it? How does the, and then also, how does the algorithm perform? I am Cristine Dewar. I am an applied machine learning scientist on a Affirm’s ML team for fraud. – And my name is Xiang and I’m also an applied machine learning scientist at Affirm, but I’m on the underwriting team. – So Affirm offers point of sale loans for our customers. Our applied machine learning team creates models for credit risk decisioning and for fraud decisioning, and also builds recommendation systems to personalize customers’ experiences. So for both fraud and credit, it is extremely important to be able to have a model that is both fair and interpretable. For credit, we need to be able to explain why they were rejected for fair lending reasons. For fraud we need to have our internal risk operations team understand why users are getting identity declined and what factors are enabling fraudulent users to slip through our fraud models. We want to understand how each feature is impacting each user, not just general feature importance. So we have millions of rows of data and hundreds of features. We need a solution that allows us to interpret how our models are impacting individual users at scale that serves up results quickly. Some might ask why we don’t just use an easily interpretable model, like logistic regression. Simply we have found that logistic regression does not perform as well as some of the black box models.
So, we need a solution that allows us to interpret the effective features on individual users and does so in a timely manner. So, let’s focus on that first bullet point first. So, in cooperative game theory, there’s a problem on how to allocate surplus resources generated by the cooperation of players. So, this is kind of when the whole is the greater than the sum of its parts. How do we divide that surplus? So the way I like to think about it is like a trivia game. The number of unique questions a team gets right is higher when played together than if each player played separately and then combined their answers. By working together and collaborating players are able to get questions right that they would not have been able to get right on their own. We want to be able to attribute not only how much did the individual bring to the table if they had worked alone, but how valuable they are as a collaborator.
So, we want to talk through these following properties when allocating marginal contribution to each player. So let’s go into each one of these in detail, symmetry, dummy, additivity, and efficiency.
So symmetry. Pretty simple. Two players who contribute equally will be paid out equally. So if you go out to eat with your friends and you eat the same amount, and then when splitting the bill, you should pay the same amount.
So, dummy. A player who does not contribute will have a value of zero. So if you go out to dinner with your friends and just have water, you will not have to pay anything.
So, additivity. So, this one’s a little more complicated, a player’s marginal contribution for each game, when averaged, is the same as evaluating that player on the entire season. So, for a random forest model with three trees, if you evaluate a feature’s importance for each tree, and then you average those values, or if you just, that should be equal to when, if you were to just evaluate that feature importance for that model.
So, efficiency. Marginal contributions for each feature when summed with the average prediction is that sample’s prediction. So, if the average prediction is .5 and we have four different features, then their marginal contribution of each one when added to that average prediction should get that sample’s prediction.
So, these are the four properties that we think are important when evaluating marginal contribution, symmetry, dummy, additivity, and efficiency. So what is the solution that has all four of these properties?
Shapley values. So, what is a Shapley value?
So a Shapley value is a way to define payments proportional to each player’s marginal contribution for all players in the group. It has all those properties we just talked about, and this concept was introduced by Lloyd Shapley in 1953. So, let’s jump into an example where we can see how the math behind the algorithm works.
So, we’re going to look this four feature model example. Earlier we were using the term player for game theory. In ML models, our players are our features and the goal they are trying to achieve is making a correct prediction. So for this example, we have a model with four features. Credit score, the number of times they were delinquent, how many times did they make a payment late, not just to Affirm, but to all lenders, how much are they asking to borrow, and then also have they repaid Affirm specifically yet. So now let’s look at the formula to see how we’ll calculate the marginal contribution.
So what we’re trying to get is the marginal contribution of feature j so we are going to sum over possible permutation orders for feature j. So what we need to do this is we need to have the place in the permutation order, the number of features, and then this entire term ends up being the fraction of permutations with feature j in that place in the permutation order. And then you multiply that fraction by the difference of the score with feature j and the score prior to adding feature j.
So why does permutation order matter? Why are we summing over the permutations? So we’re not only trying to see how well this feature works alone, we are trying to measure it with how well it collaborates with all other feature sets. So in order to do this, we need to manufacture all possible scenarios of how a feature can interact with another feature or feature set. So how can we do that? Permutations orders matter.
So, what we have here is our feature of interest FICO score. It can only be on four possible positions for this feature order.
First, second, third, or fourth, for each of these, we only care about the permutations that have a, what is the feature’s set before our feature of interest?
So what this looks like is when we are in the first position or in the last position, our feature of interest, then in the first position, there are no features before it. So that’s the only possible thing we care about. And then for when it’s in last, then there’s all the features before it. So that is also only one option. And then for second and third, there are three possible sets. So for each of these permutations, we need to compare how the model performs with and without our feature of interest. So, let’s get back to our equation.
So, now we know what we are summing over. We are going through each one of these little different permutations orders and models. We know our fraction because we know our place in the permutation order and our number of features. So now the question is, how do we get the score with feature j and the score without? So we are comparing the performance of each of these permutations when it has the feature of interest, and then when it does not.
And what we’ll do is for every possible permutation order, we are comparing these two things.
So, now we kind of get the math behind this. We see what we’re looping through. We see that we are comparing these two scores with and without our feature of interest, but how do we actually get the model scores with some features and not others in an efficient manner? we could retrain the model every time, but that is very slow and misses some interaction. So what we can do is we can do approximation.
So what we’re going to do is we are going to approximate by suppressing the permuted features contributions by making them noise. So what I like to think about here is compare it to like, if you’re creating a PDP plot. So you would hold some values constant while changing others to see how the output changes. So, Xiang will go into more detail about this during the implementation section.
So, sounds good. We have a way to get the marginal contribution for individual rows, not just generalized feature importance, but even with approximation, it seems super computationally expensive. So how do we deal with that? And to answer that I’m going to say Apache Spark and pass it along to Xiang to explain how we actually implement it. – Okay. Now let’s see how this is implemented in Spark. So I’ll basically go over a Monte-Carlo based approximation algorithm that can approximate the Shapley values. And then I’ll basically go into the details about how it is implemented using Spark.
So first I’ll talk about this Monte-Carlo based approximation algorithm. So let’s say that we had an applicant called Joe, and Joe is applying for a $500 loan at Affirm. For Joe, his FICO score is 660, he has repaid with a firm before, and he has two delinquencies on his credit report. At Affirm, we use a black box model to underwrite Joe. So that model will be consuming all these features and outputs every payment aspect. Now here, let’s say that we want to understand the Shapley value for Joe’s FICO score, which is 660.
So what we would do is that we are going to first sample a random permutation order as well as a random instance from our dataset. So here is sampled order, is first by loan amount, followed by FICO score, then number of delinquencies, and lastly, whether Joe has repaid with a firm before. This order is going to replace or approximate all of the possible permutations that you see in the earlier formula. And, as for instance, let’s call him John, and John has a FICO score of 700, he has applied for a $300 loan before, he has repaid with a firm before, and he has zero delinquencies on his credit report.
Then using this order and John’s data, we’re going to create two counterfactual instances. So for the first counterfactual instance, we’re going to take all of the features, including FICO score that are prior and before FICO score from Joe. So, which is going to be $500 loan amount and a 660 FICO score. And then, we are going to append all the remaining feature values from Joe, which is zero delinquencies and has repaid with a firm before. This is how we get our first instance. Then we get our second instance, which is very similar to the first instance, except that we’re going to replacing the FICO score with John’s FICO score, which is 700. So what do these two instances represent? So for the first instance, you can think of it as the instance that has effect of Joe’s FICO score. Whereas the second one doesn’t have Joe’s FICO score. So then we basically feed these two instances into the black box model and observe the difference. That difference will be an approximation for the marginal contribution, or the Shapley value for the FICO score. We then run this procedure many, many times, and by the law of large numbers, we would expect that the average of these differences would be very close to the actual Shapley value for Joe’s FICO score.
Now let’s take a look at how it is implemented in Spark. So, in Shparkley, we pre-select a large number of samples beforehand, and they basically will be used as reference rows or serve as a similar function as John, in our previous example. We partition this huge data set into different partitions and broadcast our instance to investigate to each of the partitions.
To illustrate the idea, let’s say that we want to calculate the Shapley value for loan amounts. Then we have some rows in a partition, which looks like the row on the left, then we also have the instance on the right, which is the instance that we’re trying to investigate. Then we are going to sample a random permutation order, which happens to be number of delinquency in the first place, followed by FICO score, loan amount, then has repaid with a firm or not. We rearrange these two instances according to this order. So that’s now delinquency as it’s at a first place, followed by FICO score, then loan amount, and lastly, whether or not has repaid.
Using these two arranged instances, we can now start creating the counterfactual instances. So, for instance, that has effect of loan amount. We’ll be taking a number of delinquency, a FICO score, and loan amounts from our instance to investigate, seeing as these are the features that are prior to a loan amount here. And we’ll use anything else from our loan partition which is yes here. For an instance that does not have the effect of loan amount, we will be basically taking only number of delinquency and FICO score from our instance to investigate, and replacing the loan amounts with our reference row which is $1,000 here.
Then we feed these two contrafactual instances into our black box model and observe the output. So here in our example, output is 0.8. For instance, that has the effect of loan amount whereas it’s 0.7 for the instance that does not have the loan amount effect. The difference would be 0.1 and will be the marginal contribution from our reference row for the loan amount for the instance that we tried to investigate.
We run through this procedure over all the rows, over all the partition. Then we can get a data frame within each partition. It will be a two column data frame where one of the columns would be the marginal contributions from the reference rows, and the other would be the feature that it is contributing to. Then all we need is an average over the features, and then we can get the Shapley values for all the features.
Before I go into the run time and the Monte-Carlo error convergence, I want to list a few of the highlights of our Shparkley implementation. The first is that this is a Spark based implementation that scales with a dataset. Since we pre-select the background or reference reel beforehand, now the runtime only is dependent on the number of background samples right now, and it is horizontally scalable. You can simply add more machines or more workers to it.
And the second thing is that our implementation can leverage the runtime advantages from the batch predictions in your inference model. So if you’re like, any kind of optimizations, such as factorization that can speed up batch prediction speed, our algorithm can benefit from that. The third thing is that our implementation reuses the predictions that are calculated around the permutation route to calculate the Shapley values for all the features at once. We have open sourced our implementation, you can find a link at the end of the slide. And if you’re curious about how we do it, you can find the details there. And last but not least, we have also added a way to support the Shapley value calculation. So if there is importance for your data instances, our implementation can account for that. Okay. Now let’s take a look at how does the Monte-Carlo error converge with the number of samples that we select beforehand? So, here we show, how does the overall Shapley values difference and how does the rank difference changes when we increase the number of samples in our background dataset. As we can see the number shrinks quickly, and it is almost minimum when we have over 100 thousand samples. So we’ll be using that for the next runtime comparisons.
So using 100 thousand reference samples and also a dummy model, we run our implementation Shparkley against an open source package shap which is created by the original author of the Shapley paper. So on a cluster, our implementation,
it runs for 18 seconds to generate the Shapley value explanation, for an instance whereas the BruteForce Explainer from the shap package took around 20 minutes. So it’s around 50 to 60 times in a run time improvement. So for the cluster configuration during we use basically 10 machines, it’s one master in line workers, and each of them is an Amazon r5.4x EC2 instance which has 16 cores. So given that the original Shapley package is not paralyzed, and so if you normalize by the number of cores, there’s still like a five to six times improvement in terms of runtime. So how does our implementation’s Shapley value compared against the BruteForce Explanation, which can be regarded as a ground truth? So here we can see on the right, is that the values of the Shapley values, they are quite small and there’s no ranking difference in terms of the Shapley values.
So, in conclusion, we have an implementation that when compared to a brute fore implementation has a runtime improvement by 50 to 60 times. And it is almost the same in terms of the Shapley values and zero difference in terms of rank difference.
This implementation is a joint effort by the following people here at Affirm, and it is available on Affirm’s GitHub. You can pick and store a Shapley package and start integrating your black box model. And that’s all, thanks for attending our presentation.
Xiang is a data scientist on the Data Science algorithms team at Affirm, working on production models to make real-time underwriting decisions as well as model interpretability. He is also interested in topics like distributed machine learning and machine learning servicing.
Cristine Dewar is a data scientist on the Data Science fraud team at Affirm. She is currently working on models to prevent fraud. Cristine is passionate about fair and explainable ML and using data science to improve lives.