Model Explanation and Prediction Exploration Using Spark ML

Black box models are no longer good enough. As machine learning becomes mainstream, users increasingly demand clarity on why a model makes certain predictions. Explaining linear models is easy but they often don’t provide enough accuracy. Non-linear models such as GLM (Generalized Linear Models) and Random Forest provide better accuracy but are hard to explain due to their non-linear nature. In addition to explaining the model predictions for the whole training population, there is a need to explain model predictions for an arbitrary subset of the population chosen by the user. Moreover, once users see how each feature contributes to the model prediction, they want to do a ‘what-if’ analysis to explore how changing the features will affect the model prediction. We have developed a technique for:

  1. Explaining non-linear models
  2. Showing non-linear feature contributions for an arbitrary subset of a population
  3. Providing what-if analysis so users can change feature values and see the effect on the prediction

We have implemented a Spark library so any GLM or Random Forest model created using Spark ML can be explained by using our library. In addition, we have created a node.js library so browser-based applications can calculate model explanation on the fly and can allow users to do what-if analysis in a web site. We are currently using this library to explain 50 billion predictions on healthcare data. In this talk, we will cover how this method works and how any Spark user can leverage our library to do the same for any GLM or RF prediction in Spark ML.

Try Databricks
« back
About Iman Haji

Clarify Health Solutions

Dr. Iman Haji is a Senior Data Scientist at Clarify Health Solutions. He received his PhD in Biomedical Engineering in 2016 from McGill University with the focus of his research on the analysis of medical big data for personalized diagnostics / rehab for neurological disorders. Shortly after his PhD, he co-founded and served as the CEO of the medical diagnostics startup 'Saccade Analytics' to apply his expertise in the field of accurate diagnosis of concussion and dizziness through virtual reality. He is currently a senior data scientist in Clarify Health Solutions analyzing healthcare big data.

About Imran Qureshi

Clarify Health Solutions

As a Chief Data Science Officer, Imran oversees the Data Acquisition, Data Engineering, and Data Science teams (over 25 scientists and engineers). With nearly 12 years of healthcare technology experience, Imran is a med-tech data science leader. Before joining Clarify Health, he was the Chief Software Development Officer at Health Catalyst. As the CSDO, Imran was responsible for the software development in the company, including leading the engineering team building the Data Operating System (DOS). Some of his daily responsibilities at Clarify include working with customers, helping his teammates excel, thinking about complex problems, and learning continuously.

About Alvin Henrick

Clarify Health Solutions

Alvin Henrick is a Staff Data Scientist at Clarify Health Solutions, with expertise in Parallel Database Systems, Interactive Querying, Distributed Query Execution, Query Scheduling, Machine Learning with Spark, and Deep Learning with TensorFlow and Keras. He is an Open source committer to Apache Tajo and Sbt-lighter plugin. Alvin holds a Masters degree in Computer Science from NIELIT, New Delhi. He has prior experience at VMware, Pivotal and Humana companies. Please visit for more info.