Alvin Henrick is a Staff Data Scientist at Clarify Health Solutions, with expertise in Parallel Database Systems, Interactive Querying, Distributed Query Execution, Query Scheduling, Machine Learning with Spark, and Deep Learning with TensorFlow and Keras. He is an Open source committer to Apache Tajo and Sbt-lighter plugin. Alvin holds a Masters degree in Computer Science from NIELIT, New Delhi. He has prior experience at VMware, Pivotal and Humana companies. Please visit www.alvinhenrick.com for more info.
Black box models are no longer good enough. As machine learning becomes mainstream, users increasingly demand clarity on why a model makes certain predictions. Explaining linear models is easy but they often don't provide enough accuracy. Non-linear models such as GLM (Generalized Linear Models) and Random Forest provide better accuracy but are hard to explain due to their non-linear nature. In addition to explaining the model predictions for the whole training population, there is a need to explain model predictions for an arbitrary subset of the population chosen by the user. Moreover, once users see how each feature contributes to the model prediction, they want to do a 'what-if' analysis to explore how changing the features will affect the model prediction. We have developed a technique for:
We have implemented a Spark library so any GLM or Random Forest model created using Spark ML can be explained by using our library. In addition, we have created a node.js library so browser-based applications can calculate model explanation on the fly and can allow users to do what-if analysis in a web site. We are currently using this library to explain 50 billion predictions on healthcare data. In this talk, we will cover how this method works and how any Spark user can leverage our library to do the same for any GLM or RF prediction in Spark ML.
The number of features going into models is growing at an exponential rate thanks to the power of Spark. So is the number of models each company is creating. The common approach is to throw as many features as you can into a model. Features that don't improve the model can easily end up hurting it by increasing model complexity, reducing accuracy and making the model hard for users to understand. However, since it takes a lot of manual effort to find the noisy features and to remove them, most teams either don't do it or do it sparingly. We have developed an AI assisted way to identify which features improve the accuracy of a model and by how much. In addition, we present a sorted list of features with an estimate of what accuracy (e.g., r2) improvement is expected by their inclusion.
There are some existing methods to handle the automated feature selection, almost all of which are computationally expensive and not translatable to big data applications. In this work, we introduce a fast feature selection algorithm that automatically drops the less relevant input features, while preserving and in some cases enhancing the model accuracy. The method starts by automated feature relevance ranking based on bootstrapped model training. This ranking determines the order of feature elimination which is much more efficient than randomized feature elimination. There are other simplifying assumptions during this feature selection, as well as our distributed implementation of the process that enable fast parallelized feature selection on medical big data.