Alvin Henrick - Databricks

Alvin Henrick

Staff Data Scientist, Clarify Health Solutions

Alvin Henrick is a Staff Data Scientist at Clarify Health Solutions, with expertise in Parallel Database Systems, Interactive Querying, Distributed Query Execution, Query Scheduling, Machine Learning with Spark, and Deep Learning with TensorFlow and Keras. He is an Open source committer to Apache Tajo and Sbt-lighter plugin. Alvin holds a Masters degree in Computer Science from NIELIT, New Delhi. He has prior experience at VMware, Pivotal and Humana companies. Please visit for more info.


AI-Assisted Feature Selection for Big Data ModelingSummit 2020

The number of features going into models is growing at an exponential rate thanks to the power of Spark. So is the number of models each company is creating. The common approach is to throw as many features as you can into a model. Features that don't improve the model can easily end up hurting it by increasing model complexity, reducing accuracy and making the model hard for users to understand. However, since it takes a lot of manual effort to find the noisy features and to remove them, most teams either don't do it or do it sparingly. We have developed an AI assisted way to identify which features improve the accuracy of a model and by how much. In addition, we present a sorted list of features with an estimate of what accuracy (e.g., r2) improvement is expected by their inclusion.

There are some existing methods to handle the automated feature selection, almost all of which are computationally expensive and not translatable to big data applications. In this work, we introduce a fast feature selection algorithm that automatically drops the less relevant input features, while preserving and in some cases enhancing the model accuracy. The method starts by automated feature relevance ranking based on bootstrapped model training. This ranking determines the order of feature elimination which is much more efficient than randomized feature elimination. There are other simplifying assumptions during this feature selection, as well as our distributed implementation of the process that enable fast parallelized feature selection on medical big data.