Black box models are no longer good enough. As machine learning becomes mainstream, users increasingly demand clarity on why a model makes certain predictions. Explaining linear models is easy but they often don’t provide enough accuracy. Non-linear models such as GLM (Generalized Linear Models) and Random Forest provide better accuracy but are hard to explain due to their non-linear nature. In addition to explaining the model predictions for the whole training population, there is a need to explain model predictions for an arbitrary subset of the population chosen by the user. Moreover, once users see how each feature contributes to the model prediction, they want to do a ‘what-if’ analysis to explore how changing the features will affect the model prediction. We have developed a technique for:

- Explaining non-linear models
- Showing non-linear feature contributions for an arbitrary subset of a population
- Providing what-if analysis so users can change feature values and see the effect on the prediction

We have implemented a Spark library so any GLM or Random Forest model created using Spark ML can be explained by using our library. In addition, we have created a node.js library so browser-based applications can calculate model explanation on the fly and can allow users to do what-if analysis in a web site. We are currently using this library to explain 50 billion predictions on healthcare data. In this talk, we will cover how this method works and how any Spark user can leverage our library to do the same for any GLM or RF prediction in Spark ML.

– Hi folks. My name is Imran Qureshi, together with Iman and Alvin today, I will be talking about model explanation and prediction exploration in Spark.

We would love to get any feedback about this session, and so feel free to send that our way.

So I will cover the introduction section. Why should you do prediction explanation? What are the current approaches? Why we felt those approaches did not meet our needs. And then I will turn it over to Iman, who’s going to talk about the details of our approach. How did we implement that for GLM? How did we implement that for random forest? And then we will talk about the open source library we’re making available, if you wanted to use this in your own Spark implementations.

Let’s jump straight in. Before we go into the details, let’s talk a little bit about Clarify Health, so you kind of know who we are. We processed over 11 billion healthcare claims today that covers 150 million patients. That’s over 40 terabytes of data. We have a lot of models that we train right now. That number is actually over 2000, and then we have over 2000 features that feed into those models. In terms of our infrastructure, it’s all built on AWS and Spark. Right now we can train 16 models for about 84 million patients in about five hours. In addition, the accuracy of our models is also in line with the industry standards. For example, we can predict mortality at a better rate than Elixhauser, which is one of our standard models in healthcare. And then we can also predict the total cost of a patient in the, about the same accuracy as the healthcare industry. So talking about how we got started, right? We started with like everyone else, probably all of you guys. We took some features, we threw in a whole bunch of data, We tuned some hyper-parameters, we chose some algorithms and then we sort of turned the crank and out came predictions. And we were pretty happy. We just took them to various doctors and healthcare executives and life science companies. And we said, look at this. So what happened is probably what you’ve all experienced. People started questioning our models. Someone would say, how do I know your model is correct? Right? This comes up too quick question. Other people started asking, well, why does Dr. Iman cost twice as much as Dr. Alvin? What is Dr. Iman doing differently?

Other people asked how much does age versus the diagnosis of the patient contribute to the prediction? Then people started asking what if questions? Well, what if Dr. Iman reduced his ER visits from two to one, what would happen to the cost of his patients? The question about bias of course comes up and people said, okay, show me that your model is not biased for race. And really largely when we took our models to doctors and we used them to tell doctor that they needed to do something differently, the reaction is, is what you would expect it. If you’ve never done that you should feel free to try and see what happens.

So what we quickly learned was we really needed to explain them all. It’s more just showing a prediction, whether it was a total cost or mortality or any prediction was not enough.

So when you start imagining this, okay, what should that explanation look like? What should the exploration look like? We came up with a few requirements. So our first requirement was that we needed to be able to show the contribution of each feature in prediction units. So for example, if we said that patients over 65, add $300 to your average cost, then that has to be in dollars to match up with our prediction, which is in dollars in this case. The second requirement was that these future contributions should add up. For example, if having patients over 65 adds $300 of cost and having a cancer diagnosis adds $400 of cost, then the inpatient costs should be the sum of these or $700.

Another major requirement was it had to work on any arbitrary sub-population. So we can’t define all the sub-populations of interest ahead of time because in our application, users will go and they will create their own sub-population to analyze. So it had to be something that would be done at runtime, not before that.

And we needed to really work with the GLM, generalized linear models and random forest algorithms. Certainly linear models are really easy to explain, but for types of data we need to predict they’re not good enough. And as I mentioned before, we deal with very large volumes, so it had to be something that would scale for thousands of features on, you know, many millions of rows.

And it also had to enable the what-if analysis scenario. So for example, people want to ask the question, what if I reduced my ER visits by 10%, what would my total costs become?

And lastly, it had to work in Spark. All our infrastructure is in Spark and we did not want to build another infrastructure to handle that. So then we looked at what are the solutions that are available out there? So the first solution that’s common is called LIME. And if you look at how it matches up with our requirements, it does show the contributions and prediction units, but the feature contributions do not add up in LIME. And it doesn’t really allow you to work on the arbitrary sub-population at runtime, the way it does work with GLM and random forest algorithms, but it works using a surrogate model, which is close, but not exactly what your model is doing. It definitely does not scale. There’s no good implementation as Spark right now. And the other implementations take a very, very long time. It does enable you to do what-if analysis, but as I mentioned before, it does not work in Spark.

The other option that’s available out there is called SHAP. So this again does work in the prediction units and the feature contributions do add up, but it’s not designed to do this on an arbitrary sub-population, you sort of have to do your SHAP analysis ahead of time before you can show that. it again uses as a surrogate model, which is close, but not exactly what the existing model does. And it certainly doesn’t allow you to do what-if analyses. So let’s take an example of what the end results looks like before Iman walks you through how we built it. So this is an example of the end result. So in this case, you’re looking at a prediction explanation. And if you look at the top right, you can see the predicted average cost for this subpopulation was 14,581. So how, what actually made up that prediction? So if you look at on the left hand side, actually, on the bottom. So we’re, this is we’re showing relative cost compared to the whole training population right now in this view. So that says zero. So then you look at, in this population that I’m looking at compared to the whole population, their cost in the past 265 days is higher. So if you look on the left hand side, we have a bug, of course, it shows percentage instead of dollar amounts, but really it should be that the cost in this subpopulation is $20,800. And the cost in the whole training population was $19,800. So the cost in the past 265 days for this subpopulation is higher. So you can see that that adds $344 to the predicted cost for the next year, Similarly, if you look at the next row, it’s talking about the patients in, the percentage of patients in this subpopulation that are eligible for Medicare. You can see that it’s a smaller percentage, a 19% compared to the whole population of 22%. So what that does is that reduces your average inpatient cost by almost $70. So again, you can go down here and you can see the contribution of every single feature and how that feature is different from the whole population. And that’s showing up in the left hand side and the right hand side, those bars are showing the contribution of all of these features. So if you added up all these bars, you would get to $14,581, which is essentially the predicted cost.

So this really allows our customers to understand exactly what the behavior of any sub-population is compared to the whole population. It also helps them understand what the differences in features are contributing to the difference in the prediction. And this really helps them, provides them visibility into the model. And we found that when we presented this to our customers, they really started to believe our predictions and our model. So at this point, I’m going to turn it over to Iman, who’s going to walk you through how we implemented this for both GLM and random forest. And then talk a little bit about our open source library, that’s available if you wanted to use this. – Hi, Iman Haji here, and I’m going to walk you through our methodology for explanation of random forests and GLMs. Let’s start with a GLM explanation.

So the very simple case of GLM, so you see the equation here. The prediction for a GLM is a non-linearity denoted by age, applied on the sum of feature weights, multiplied by their values. So a very simple case of this is the linear regression, where this nonlinearity, age, is actually identity. In that case, if we call beta or weta, which for each feature multiplied by its own value, as the linear contribution of the feature, the prediction for linear regression is the sum of linear contributions of features plus the intercept. So in that case, explanation is very straightforward because the contribution of each feature to the overall outcome is basically its linear contribution or the feature weight multiplied by its value. But then things get more complicated once the nonlinearity kicks in. So if we have to pass the sum through a nonlinearity such as a sigmoid, then explanation becomes complicated because we have a bunch of positive and a bunch of negative contributors. And then the overall it’s coming, overall, a prediction comes out of a nonlinearity. How do we break down this overall output within and among the, among the contributing features?

So the way we do that is, very briefly, We pool the features into two bins. One is positive contributors, and one is negative contributors. And then we linearize their contribution to the overall outcome around the point of intercept. Now I’m not gonna get into the mathematical details. We’re working on a paper and hopefully it will be published soon, and you can refer to that. What I’m going to do instead is, let’s look at a couple of examples and see how our algorithm deals with prediction explanation. So on the top, you have the identity link, which refers to the simple linear regression. And in the bottom, we have the, a sigmoid activation, which results in the popular logistic regression. So both linear regression and logistic regression are special cases for a generalized linear models or GLMs. So on the top left, you have Feature1, Feature2, and Intercept columns. And for them, for Feature1 and Feature2, we have the linear contribution. So what is put there is feature weight multiplied by feature value. And as you can see, for the top table, our algorithm, what it spits out as our feature contributions are simply linear contributions. So the columns that you see inside the red rectangle, mimic what you see in the three columns that you have on the left, because we’re dealing with a linear regression. And on the right, you see the prediction, which is the sum of linear contributions. The right most column is the prediction column. And the contribsSum is the sum of the contributions that are generated by our algorithm inside the red rectangle. Now, things get more complicated if we go to the second table. So what we have there is, again, we have the linear contributions on the left, but this time, the sum of these linear contributions gets passed through a sigmoid function, because this is what, this is a logistic regression. So for example, for the, for row number zero, we have Feature1 as 0, Feature2 as 0, and intercept is equal to 1. If we sum them up, it’s going to be 1, pass it through the sigmoid. And we have .73, as you can see on the right. Now, the algorithm correctly says, all right, Feature1 was 0, so it’s not contributing to the linear sum, so it shouldn’t be contributing to the overall outcome. So the same goes with Feature2. And the overall contribution of the intercept is going to be a .73 because the sum of these contributions should add up to the prediction. Now, the second row, which is row number one here, the feature, linear contribution of Feature1 is 1, for Feature2 it’s -1, and intercept is again 1. So the sum is still 1, because 1 plus -1 plus 1 is still 1, pass it through the sigmoid, and the outcome is still .73, but this time the algorithm correctly identifies that, alright, we have two features that are non zero. One, it has a positive contribution, one has a negative contribution, and it allocates a contribution for them to the overall outcome. So they have .19 and -.19. And again, intercept of contribution is .73. So the total of these three contributions, again, adds up to .73. Now you can look at rows two and three by yourself. I would like to compare row number one with row number four. So as you can see, the linear contribution of Feature1 in row number one is 1, and row number four is 2, and for Feature2, again, the linear contribution is doubled. And in the red rectangle, you see the contributions determined by our algorithm. As you can see, the contributions are scaled as well, as they’re not exactly doubled because we have a nonlinearity here, but good news is the algorithm first identifies negative and positive contributors and sort of preserves the scaling as best as possible, even the non-linearity. Let’s move on to random forest, and how we explain random forest.

So if you want to explain random forest, or if you wanna explain the forest, you better know how to explain a tree. So let’s, let’s start by a decision tree. So this simple example is, we have a model, let’s say to predict a simple predictor of a house price. Now, let’s say the house price is estimated as $400K, multiplied by: if the square footage is greater than 1000; plus $300k multiplied by: if the bedroom count is greater than two; plus $200k multiplied by: if the house is built fairly recently, the age is under 15; plus an intercept of $100k. Now, if we want to do a train, a regressor tree to return these values, it looks something like this. So the first branch is square footage, and then a bedroom count and house age. And you can see, we have three features, binary features. We have eight possible outcomes and the tree looks like this. So at the time we wanted to explain the first one, to use random forest or tree ensembles in general, and then be able to explain them. We looked around and we found a solution out there that does something like this. First, we have to calculate the average of the leaves associated with each node at the node. So for the first nodes, where we see square footage, the roots node, let’s say, as the average value of all leaves is the average of all outcomes, all of the eight outcomes and it’s $550K. Then let’s say we go to the upper branch. We can see that for that node, we have to average the top four outcomes, which are $1M, $800K, $700K, and $500K. We do that, we get $750K. And again, for the upper node, if we go forward, we have $900K, which is the averages, the average of $1M and $800K. Now the algorithm said that you consider the average at root as the contribution of intercept. So $550K is going to be the contribution of the intercept. And now let’s consider the case that we want to go, we want to have an estimation for a house that has a greater than 1000 square footage, a bigger, more than two bedrooms, and the age is less than 15. So we’re going to go with the upper most branch. And the prediction is $1M. So we start from the root node, which is the contribution of intercept of $550K. Then if we go up again, we have to add another $200K to get to the next average, which is $750K. So $750K minus $550K is $200K. So the contribution of the first feature, which is square footage will be $200K. Again, we go forward. If the bedroom count is greater than two, $750K becomes $900K, So we have another $150K added. And again, if we go forward, if the house age is under 15, we have another $100K. So let’s put all of this together. On the top, we have the original model that we wanted to emulate using the decision tree, and the bottom, we see the contributions. There is a problem here, and the contribution of square footage was supposed to be $400K, but now we estimated as $200K. The same goes with bedroom counts, house age and intercept. It was supposed to be $100, but these numbers don’t match. And that’s where we had to design our own prediction explainer, because what we wanted was a prediction explainer for random forests that matched the linear regression and matched our prediction explainer for GLM that I just explained to you. So here’s how we do it at this time, that matches a linear regression. So for the contribution of intercept, we set all feature values to zero. So we call this the exclusion path of the roots node, and it’s going to be denoted, it’s going to be shown by the red path that you see here. So the outcome of the prediction of this red path is $100K. So we consider that the contribution of the intercept. So at each node on the graph, we have an exclusion path, meaning the remaining features are going to be set to zero, and we have an inclusion path, meaning the value of the current feature is going to be set to its actual value for that prediction. And the remaining features are going to stay as zero. So for example, for the first feature, square footage, we see that the inclusion path is the green path here, and the red path is the exclusion path at that node. So the difference between the outcome of the inclusion and exclusion paths is going to be $500K minus $100K, is going to be $400K. So that’s going to be the contribution of square footage. Now we go forward in the tree. We fixed the square footage as what it was. And then for the next feature, we do the inclusion and exclusion paths again. So, bedroom count, if it wants to be greater than two, from this prediction that we were talking about, it is. So the inclusion path is, that feature is going to be set to one, and the rest of the remaining features are going to be zero. So again, the contribution of a bedroom count is going to be $800K minus $500K, which is $300K. And we do the same thing for the remaining nodes. The next one is house age, $1M minus $800K. It’s going to be $200K. So that’s the contribution of the last one. So if we put all of these together, what do we have? Looking up again, and we have the original model. And now in the bottom, we see the contribution of intercept is $100K and the contributions line up nicely. So that’s what we wanted. So we have a prediction explainer for decision trees that matches our interpretation or our expression using linear regression or GLMs.

Alright, so we explained decision trees. And once you can explain a decision tree, any ensembling of the trees are explainable, and we’ve implemented it for random forest using some, some simplifications as well to make it faster. But what we have here, these are the benefits of our approach, we have the GLM and random forest explainer. they’re distributed, implementation is available in Spark, and we have open sourced them. I’m going to show you how you can access them. They’re scalable, and they’re also applicable for Python and Scala.

So let’s talk about our libraries, our chief data scientists, Alvin Henrick, has implemented these, and has done a great job of optimizing the implementation. And you can see the readme files are also available. Please refer to them. There’s a lot of information there helping you to use them, but simply what this is, is a drag-and-drop now. You have the transformers implemented. What you need to define, this is the GLM explainer. You have to define your coefficients that you have used for the, to get from the model you have trained. You have your predictions, of course. And you give the prediction explainer transformer the family that you use for your GLM, tweedie, gamma, poisson, or whatever it is, the link power, the variance power. These are the attributes of GLM. And then you define if you want the prediction explainer to come out as a nested array column, a column of nested arrays, or you want separate columns, and bunch of other attributes that you can set. So please refer to the readme file, there’s detailed explanation of those. And at the end of the day, it’s just a transformation. You give the original data frame, and then it creates an additional column, which is the prediction explainer column feature contributions column, and you have your production explainer there. And the same for a random forest explainer. You have to give it the, we call it the coefficients here, they’re actually the feature importances that you extract from the random forest. And you have to also give it the random forest model. So you have to define the model path, because as I said, for each node, we have to make the prediction for the inclusion minus exclusion path. So we have two predictions at each node. So we need access to the original model. And it does this very quickly. So just to give you an example, if we want to, let’s say, explain hundreds of millions of rows that we do very often at Clarify, a hundred million rows, let’s say. That would take, if he wants to use something like Shapley, it would take us months or years to do that. But in this fashion that could be done in 30 minutes to an hour, so, which is a huge, hugely time saving.

And this is our, the URL for our Github repository. Please feel free to download. And we’d love to see your contributions to that and your comments to that. They’re ready to integrate, just drag and drop the transformers into your projects, into your pipelines and instructions are on the readme files. And currently we only have support for generalized linear models and random forests. But as I said, any ensemble tree is within reach, we just haven’t done it. These are fairly recent developments. So if you want something, or if you have suggestions for additions, please let us know.

Clarify Health Solutions

Dr. Iman Haji is a Senior Data Scientist at Clarify Health Solutions. He received his PhD in Biomedical Engineering in 2016 from McGill University with the focus of his research on the analysis of medical big data for personalized diagnostics / rehab for neurological disorders. Shortly after his PhD, he co-founded and served as the CEO of the medical diagnostics startup 'Saccade Analytics' to apply his expertise in the field of accurate diagnosis of concussion and dizziness through virtual reality. He is currently a senior data scientist in Clarify Health Solutions analyzing healthcare big data.

Clarify Health Solutions

As a Chief Data Science Officer, Imran oversees the Data Acquisition, Data Engineering, and Data Science teams (over 25 scientists and engineers). With nearly 12 years of healthcare technology experience, Imran is a med-tech data science leader. Before joining Clarify Health, he was the Chief Software Development Officer at Health Catalyst. As the CSDO, Imran was responsible for the software development in the company, including leading the engineering team building the Data Operating System (DOS). Some of his daily responsibilities at Clarify include working with customers, helping his teammates excel, thinking about complex problems, and learning continuously.

Clarify Health Solutions

Alvin Henrick is a Staff Data Scientist at Clarify Health Solutions, with expertise in Parallel Database Systems, Interactive Querying, Distributed Query Execution, Query Scheduling, Machine Learning with Spark, and Deep Learning with TensorFlow and Keras. He is an Open source committer to Apache Tajo and Sbt-lighter plugin. Alvin holds a Masters degree in Computer Science from NIELIT, New Delhi. He has prior experience at VMware, Pivotal and Humana companies. Please visit www.alvinhenrick.com for more info.