Iman Haji - Databricks

Iman Haji

Senior Data Scientist, Clarify Health Solutions

Dr. Iman Haji is a Senior Data Scientist at Clarify Health Solutions. He received his PhD in Biomedical Engineering in 2016 from McGill University with the focus of his research on the analysis of medical big data for personalized diagnostics / rehab for neurological disorders. Shortly after his PhD, he co-founded and served as the CEO of the medical diagnostics startup ‘Saccade Analytics’ to apply his expertise in the field of accurate diagnosis of concussion and dizziness through virtual reality. He is currently a senior data scientist in Clarify Health Solutions analyzing healthcare big data.


Model Explanation and Prediction Exploration Using Spark MLSummit 2020

Black box models are no longer good enough. As machine learning becomes mainstream, users increasingly demand clarity on why a model makes certain predictions. Explaining linear models is easy but they often don't provide enough accuracy. Non-linear models such as GLM (Generalized Linear Models) and Random Forest provide better accuracy but are hard to explain due to their non-linear nature. In addition to explaining the model predictions for the whole training population, there is a need to explain model predictions for an arbitrary subset of the population chosen by the user. Moreover, once users see how each feature contributes to the model prediction, they want to do a 'what-if' analysis to explore how changing the features will affect the model prediction. We have developed a technique for:

  1. Explaining non-linear models
  2. Showing non-linear feature contributions for an arbitrary subset of a population
  3. Providing what-if analysis so users can change feature values and see the effect on the prediction

We have implemented a Spark library so any GLM or Random Forest model created using Spark ML can be explained by using our library. In addition, we have created a node.js library so browser-based applications can calculate model explanation on the fly and can allow users to do what-if analysis in a web site. We are currently using this library to explain 50 billion predictions on healthcare data. In this talk, we will cover how this method works and how any Spark user can leverage our library to do the same for any GLM or RF prediction in Spark ML.