AI-Assisted Feature Selection for Big Data Modeling

The number of features going into models is growing at an exponential rate thanks to the power of Spark. So is the number of models each company is creating. The common approach is to throw as many features as you can into a model. Features that don’t improve the model can easily end up hurting it by increasing model complexity, reducing accuracy and making the model hard for users to understand. However, since it takes a lot of manual effort to find the noisy features and to remove them, most teams either don’t do it or do it sparingly. We have developed an AI assisted way to identify which features improve the accuracy of a model and by how much. In addition, we present a sorted list of features with an estimate of what accuracy (e.g., r2) improvement is expected by their inclusion.

There are some existing methods to handle the automated feature selection, almost all of which are computationally expensive and not translatable to big data applications. In this work, we introduce a fast feature selection algorithm that automatically drops the less relevant input features, while preserving and in some cases enhancing the model accuracy. The method starts by automated feature relevance ranking based on bootstrapped model training. This ranking determines the order of feature elimination which is much more efficient than randomized feature elimination. There are other simplifying assumptions during this feature selection, as well as our distributed implementation of the process that enable fast parallelized feature selection on medical big data.

Try Databricks
« back
About Alvin Henrick

Clarify Health Solutions

Alvin Henrick is a Staff Data Scientist at Clarify Health Solutions, with expertise in Parallel Database Systems, Interactive Querying, Distributed Query Execution, Query Scheduling, Machine Learning with Spark, and Deep Learning with TensorFlow and Keras. He is an Open source committer to Apache Tajo and Sbt-lighter plugin. Alvin holds a Masters degree in Computer Science from NIELIT, New Delhi. He has prior experience at VMware, Pivotal and Humana companies. Please visit for more info.

About Imran Qureshi

Clarify Health Solutions

As a Chief Data Science Officer, Imran oversees the Data Acquisition, Data Engineering, and Data Science teams (over 25 scientists and engineers). With nearly 12 years of healthcare technology experience, Imran is a med-tech data science leader. Before joining Clarify Health, he was the Chief Software Development Officer at Health Catalyst. As the CSDO, Imran was responsible for the software development in the company, including leading the engineering team building the Data Operating System (DOS). Some of his daily responsibilities at Clarify include working with customers, helping his teammates excel, thinking about complex problems, and learning continuously.

About Iman Haji

Clarify Health Solutions

Dr. Iman Haji is a Senior Data Scientist at Clarify Health Solutions. He received his PhD in Biomedical Engineering in 2016 from McGill University with the focus of his research on the analysis of medical big data for personalized diagnostics / rehab for neurological disorders. Shortly after his PhD, he co-founded and served as the CEO of the medical diagnostics startup 'Saccade Analytics' to apply his expertise in the field of accurate diagnosis of concussion and dizziness through virtual reality. He is currently a senior data scientist in Clarify Health Solutions analyzing healthcare big data.