AI-Assisted Feature Selection for Big Data Modeling

Download Slides

The number of features going into models is growing at an exponential rate thanks to the power of Spark. So is the number of models each company is creating. The common approach is to throw as many features as you can into a model. Features that don’t improve the model can easily end up hurting it by increasing model complexity, reducing accuracy and making the model hard for users to understand. However, since it takes a lot of manual effort to find the noisy features and to remove them, most teams either don’t do it or do it sparingly. We have developed an AI assisted way to identify which features improve the accuracy of a model and by how much. In addition, we present a sorted list of features with an estimate of what accuracy (e.g., r2) improvement is expected by their inclusion.

There are some existing methods to handle the automated feature selection, almost all of which are computationally expensive and not translatable to big data applications. In this work, we introduce a fast feature selection algorithm that automatically drops the less relevant input features, while preserving and in some cases enhancing the model accuracy. The method starts by automated feature relevance ranking based on bootstrapped model training. This ranking determines the order of feature elimination which is much more efficient than randomized feature elimination. There are other simplifying assumptions during this feature selection, as well as our distributed implementation of the process that enable fast parallelized feature selection on medical big data.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi, folks. I’m Imran and I’ll be talking to you, together with my teammates Alvin and Iman about how to do AI assisted featured selections in Spark.

SUMMIT 2020 Al-Assisted Feature Selection for Big Data Modeling

And be sure to give us feedback. We would love to hear from you.

So I’m gonna walk you guys through the introduction. Why should you do feature selection? What are the current approaches and the limitations and how is our approach different? And then I’m gonna hand it over to Alvin who’s gonna go into more details about our approach. And then lastly we’ll talk about our open search library that you can use to do this yourselves in Spark.

First, something quick about Clarify Health. We process about 11 billion healthcare claims for 150 million patients. It’s a little over 40 terabytes of data. We train hundreds and hundreds of models. Right now we’re at 2,000 models and they contain about 2,000 features. Everything is built in AWS in Spark and you can, within five hours we can train models for 84 million patients. So the AI assisted feature selection is part of our Auto-ML pipeline. Or how I like to call it, how can new models be created while I’m sleeping? So we start out by acquiring the data certainly, the second step is our AI assisted feature selection. So this is where we’ll talk about today. We’re able to automatically selected which features should be included in the model. After the features are selected then we go through a selection of the algorithm to use for creating the model and after that we do an automated Bayesian Hyper-Parameter tuning to find the appropriate parameters for that model. And then next we do the training of the model.

And then after we have trained the model, we have a set of automated tests that run that are checking various things in the model and figuring out if the model is good or bad. If the model is bad then it gets put into the rejected queue where a data scientist will go in, take a look and figure out what went wrong. If the automated test pass, then the model is promoted to the model tracking and then after that into the model explanation stage where we generate the explanation for the model and we also generate the metadata of it to be able to do exploration. And then these models become available in our interactive UI that customers can use to interact with the UI, see the predictions and play with the models.

And our main goal here is no human involvement required unless there’s a model is rejected and the data science needs to go look at it. Okay, so let’s talk about are more features better or not? So certainly, there’s a lot of people who are still have models with less than 20 features. If that’s true, then this is probably not the talk for you. But if you’re people who have more than 20 hundreds of features, then certainly less features is may be better. So let’s talk about some of the problems with more features. So first is that more features make the models hard to understand. So if you’ve ever looked at coefficients for 2000 features or feature importance for 2000 features, you’ll understand that. It also increases the training time, the more features you have, the longer it will take Spark or anything to train the model. And then it introduces noise into the models. So as you have more features, the models is not able to find the optimal parameters. And you also increase the chance of having highly correlated features. So this can confuse the models in turn, resulting in a sub optimally tuned model. And lastly, more features can result in overfitting so your model is not able to generalize well. So it’s something I would say less is better if you can figure out what the right set of features are to use.

Okay, so what are the common current approaches? Right, so the first approach people take is what I would call the domain knowledge approach, right? So someone will say, well, I know that diabetes people don’t stay longer in the hospital. So as a result, we should not include diabetes as a feature. What we found in the past is really a lot of times our domain knowledge is not that good. Once we look at the data, we find that a lot of these domain knowledge that we hold is not actually correct. Second thing people try is they remove features one by one from the model. So you will remove a feature, see how the model performs and see if that feature was really useful in that model. That works if you have a few features, but when you have hundreds of thousands of features, the number of possible combinations is extremely large. And there’s just no way you can try that. The last approach people tend to try is just use all the features. I mean, we spend time building these features. Let’s just throw them all in and let the machine learning models figure this out. This also results in suboptimal models because the model training is not able to figure out everything for example, the difference between one hot encoding and numerical encoding is really something a human figures out and says do the numbers means something, or is it just unique values? This is also something I would call everything but the kitchen sink approach. So the ideal approach we’ve seen is that it’s combining the human and the machine together. So this is what we call the AI assisted feature selection.

When we thought about this, we said, okay, what is it that this would need to do? So the first requirement was, it had to be automated, and had to intelligently select features. It wasn’t a random selection of features. Number two was it also had to scale. It had to handle thousands of features like we have and at least 50 million rows and 40 terabytes of data.

And the last one is, we also wanted the ability to preserve certain features that we always want. Customers might wanna see age, even though it may not be the best feature for the model. So these were the main requirements. I’m now going to turn it over to Alvin who’s going to talk about how he and the other team built the our approach to be able to address these. This is Alvin Hendrick and thank you Imran for giving such a nice overview, why we chose to use the structure learning approach based on and then, we our goal was to basically come up with an approach which Imran already described, right?

Our approach:

To basically get rid of thousands of features which we have and focus on the features which we really need to focus on. We chose an approach called a structure learning approach, which is a filter based approach, right? There are wrapper methods out there, tailored for very high sparsity, scalable and efficient sub-optimal feature selection, right, An option to specify must include features. So we were looking for solution which basically fits our needs, because of the needs which you see on the screen must_include_features. We just don’t want to get rid of all the features, right? So the steps we have is like quick ranking of all the feature importances, no model training needed. Then we do a linear search based on the rank feature and then train on N cross validation step model to basically get the best model out of those features.

Our approach the ranking algorithm (I/I)

So here’s our project looks like. So you see on the bottom of the screen the feature set, this is the whole feature set one to n, right hand side you have an outcome and label and then we have the rank feature on the left hand side, which is a yellow box. So the way we calculate is basically we calculate a similarity between the feature and the outcome label which has the highest correlation. And once we find that highest correlated feature, we put it in the ranked bucket with a score in front of it.

So you see that the feature key is now in the ranked feature bucket, right? Because that is highly correlated with the label outcome, which we figured out the similarity with the highest correlation.

So this is how we move towards the next feature. So now we pick a feature_two. If you see on the bottom of your screen, right? that feature should have a highest correlation to the outcome label, but it should have the minimum correlation with an existing feature which are already in the ranking feature bucket right? So, we are going to penalize using the formula which you see on the screen, right? It’s the basically similarity of the feature I with an outcome of label and then taking the max of similarity of feature I in conjunction with the feature already in the ranked bucket right. So, we choose to max or average so that the scoring is independent from the number of features that are already selected, right? So, all we are doing is we are taking the highest correlation with an outcome label and making this feature to deeper ties because there is another feature which already have in high correlation with an existing feature in the ranking bucket.

Right, so feature_2 gets there and it gets the rank number two as you can see.

So, here’s the algorithm. I’ll describing the quick steps right. So we rank feature based on their coalition with an output, select the feature with the highest correlation score as top rank and remove it from back of unranked feature. Ranking the remaining feature based on this formula, feature with close absolute correlation of features outcome minus a max of the feature with an already ranked feature in the bucket. So selecting the feature with the highest feature score as the next rank and removing it from the bag of features, right? We repeat this process until all the features are ranked in unranked bucket. If there is a must include feature in the list, we put them on the top of the already ranked features list, because we want those features to be in the model Here’s the benefit of our ranking algorithm. So rank feature based on the correlation with outcome. If a feature significantly correlates with previously rank feature, it gets de-prioritized, right? As I said earlier, even if it has high correlation with an outcome, right, the ranking is supervised but without the computational burden of training. So here’s an example. So we have feature_3 which is highly correlated with feature one, the selection of one of them will be suppressed the selection of the other as the core cross correlation is very hundred percent right? So, the selection of one of the features will be suppressed than the other one because they are highly correlated. So, both will give us the same answer or same benefit for a model.

Our approach: examples

Now take an example two write where feature_3 is correlated to feature one and two, right. But even the feature two was correlated, right? It will be like you can have a feature, second one can get deprioritized, right? Or the third one gets prioritized, right? Because of the high correlation, right? And then what happens is, here’s an example, we chose a Boston data set where we had all the features, right? So the first experiment to demonstrate we run it with no constraint, which means we don’t care what features to include, we let the algorithm figure it out. So here’s the ranking of the features, like you get the LSTAT, PTRATIO, right rooms, CHS right? These are the couple of features right? But what we did in the next experiment, we choose text and in this features to be included in the list because we are going to rank them very high because we want them to be part of the model right? And then you see the list on the bottom changes, but it doesn’t change as much because if you see LSTAT, PTRATIO and RM was on the list on the top list. Similar thing goes with the second experimentation where RM, LSTAT and CHAS are still on the top of list plus the other two features gets included as well, and the others follows, right. And then what we did was it was just an experiment ran on two sample to demonstrate the number of tress 20, and MaxDepth five.

So here’s the simple graph to demonstrate. So if you see on the bottom left of your screen, this is the must include feature we started with the R square is just about point four, right on the left hand side. And these are the number of features we included. So the default we started with number of features included is the first two features. And all of a sudden on the top you see as soon as you get the four features into the model, right? You see an elbow there and that’s what helps us to detect when we could stop doing more feature selection or we have reached the peak where we can get that’s the best R squared for this model training we will get. We don’t need to look into the other features as needed. But if you see when we don’t give any opinion, it starts with the R square of point six. But eventually it goes to point eight, right? Or somewhere around that, right? Not more than that. So the algorithm is trying to figure out and it’s trying to do a feature selection and in an optimal way, right? And luckily it has a capability to include those features which we want.

Our approach: Incremental training

So here’s the approach and incremental training before sample the training data to our desired size. After extracting the feature ranks, we start model training by one to N features in step of s feature per iteration cross validation for each model. At each step, the model accuracy is estimated. If s equals one it means that all the features are gradually added one by one. A larger s means that the multiple line features are added at each step right. So we are trying to once we have ran those features, you’re trying to train the model with the rank features, and we have capability in the algorithm to choose how many features you want to include and train the model to get the best model out of those ranked features. So the end result was that plot which we already saw, right, and using the elbow detection, we find the optimal number of feature to be selected for this model. As we saw the number of features for the Boston data set was we need to just select four features at the max to get the same accuracy.

Our approach: Benefits

So it is benefits of our offer is fast and scalable, quick feature ranking, no model training needed. We just need a correlation matrix, a linear model training steps for each cross validation. You can manually select specific feature to be included in the modeling application requires them to which is a very important thing for us and select other features to complement them. Scalable implementation with spark, application on big sparse data sets successfully reduces our features 30% while maintaining the accuracy, so that we can reduce our feature set by the 33% while maintaining the same accuracy as we would have got with all those thousands of features in the model.

Here’s our open source library. So we’ll be open sourcing our library, thanks to Imran, he just really worked very hard on the test, to make this thing happen. And so we decided to open this source this library, and then it is applicable and available both in Python and Scala.

So here’s the two methods, you will see. Notice, here’s a simple API. First is the feature ranker, you pass in your data frame. You pass your feature columns, and then you get the output column and you have to have a must include feature list you want to include.

Simple API

And then we have a feature selector. From the model like figuring out from the rank feature, what are the best feature we could use to basically train the model finally. Here you will see a feature inclusion increment, which is a very important parameter. So it was like it allows you to do a jump whether you want to try each and every rank feature and then train the model to basically extract the best R square of the model or best fit. Or you can do a set of model like I want to try from these rank features, I want to try three of them in incremental of three. Or like the first three will be included, the next three, the next three, and then it will figure out. And then you can obviously specify the train test split ratio, cross validation, if you want to do an evaluation matrix or we want to evaluate against r square or an RMSE or ME.

Sample Usage

Here’s a simple test to test it using the Boston data set. So you see the first thing is to basically do the feature ranking. Once you have got the feature rank, you call the feature selector to gives you the scores back. So what are the best features to be included and the best scores for each feature as it will print it on the screen once you go through the example.

Our library: Further enhancement

So here’s our library is hosted on the clarify health part feature selector. It has adding other ranking. Our future goal is to basically add few more ranking algorithm to the library, right now we just sorted by the scores. But there are other ways to do it. We have to convert as you saw, these are simple functions to you just pass in the data frame and parameter. Our goal is to convert this thing into a pipeline transformer. So it can be incorporated in our auto ML pipeline. So it goes through the transformation phase, the way we’d use the pipeline.transform API to seamlessly integrate into our system.

So don’t forget to rate the session. Thank you for listening to us. Thank you Imran.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Alvin Henrick

Clarify Health Solutions

Alvin Henrick is a Staff Data Scientist at Clarify Health Solutions, with expertise in Parallel Database Systems, Interactive Querying, Distributed Query Execution, Query Scheduling, Machine Learning with Spark, and Deep Learning with TensorFlow and Keras. He is an Open source committer to Apache Tajo and Sbt-lighter plugin. Alvin holds a Masters degree in Computer Science from NIELIT, New Delhi. He has prior experience at VMware, Pivotal and Humana companies. Please visit for more info.

About Imran Qureshi

Clarify Health Solutions

As a Chief Data Science Officer, Imran oversees the Data Acquisition, Data Engineering, and Data Science teams (over 25 scientists and engineers). With nearly 12 years of healthcare technology experience, Imran is a med-tech data science leader. Before joining Clarify Health, he was the Chief Software Development Officer at Health Catalyst. As the CSDO, Imran was responsible for the software development in the company, including leading the engineering team building the Data Operating System (DOS). Some of his daily responsibilities at Clarify include working with customers, helping his teammates excel, thinking about complex problems, and learning continuously.

About Iman Haji

Clarify Health Solutions

Dr. Iman Haji is a Senior Data Scientist at Clarify Health Solutions. He received his PhD in Biomedical Engineering in 2016 from McGill University with the focus of his research on the analysis of medical big data for personalized diagnostics / rehab for neurological disorders. Shortly after his PhD, he co-founded and served as the CEO of the medical diagnostics startup 'Saccade Analytics' to apply his expertise in the field of accurate diagnosis of concussion and dizziness through virtual reality. He is currently a senior data scientist in Clarify Health Solutions analyzing healthcare big data.