AutoML Toolkit – Deep Dive

Download Slides

Tired of doing the same ole feature engineering tasks or tuning your models over and over? Come watch how Databricks Labs is solving this. We will explore how this toolkit automates and accelerates: Feature Engineering/Culling Feature Importances Selection Model Selection & Tuning Model Serving/Deployment Model Documentation (MLflow) Inference & Scoring

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi, thanks for joining us today. My name is Daniel Tomes and with me is Ben Wilson and Jas Bali. We’re the core developers for Databricks, Labs AutoML Toolkit. I’ll be starting us off today talking a little bit about the high level introduction to AutoML kit. Then Ben will go into a deeper dive on some of the great features that we have on the feature engineering piece. Then he’ll go into the core tuning algorithm a little bit and talk to you about the roadmap and what’s to come in the future. And then Jas will clean this up and finish us off with a end to end demo showing us how it works from start to finish and show you what AutoML can really do in production. So without further ado, we’ll just go ahead and get into it. (clears throat) Okay, so to get started when I think Ben built us a few times, probably and targets, and when he started off, he and I were we’re working together, a little bit after he got started on this idea. And we had a few goals in mind, the goals were to simplify that feature engineering, and that all that trivial boilerplate code that you copy and paste from online every time you start a model. We also wanted to be able to get to something real and something that was somewhat usable, faster, that way that you could kinda get going and see if you had even an idea that was worth pursuing. After that, I think Ben’s brain went off the rails and he started creating all these new crazy ideas and started adding them into the codebase. And they were crazy ideas to me, but I guess they were things that he kinda had built several times in the past. And as some of those features started to come into the codebase, and I started to use them at customers as his number one champion, I (chuckles) I started to actually get pretty good results. And we knew we had something when we started to outperform data scientists that our customers with this tool set, and this was over a year ago, and ever since we’ve had opportunities to explore this at different customers and it’s only been improving ever since. Then, about a year ago, Jas came online and said, hey, I wanna improve this and help you guys develop. We said, that’s fantastic, come on. And so Jas then was introduced to the team and started to work with us and develop the pipeline, which I’ll talk to you more about later. This is great because it helps us to track and reproduce results very easily. Okay. So the other one AutoML Toolkit has a lot of different features, and I’m not gonna run through all of these today, because you can read and they’re all documented really well on the, online.

AutoML Toolkit – Includes

You can find this and all the information And there you’ll see a lot of all the labs that have been published by Databricks publicly, and then you can actually grab the AutoML Toolkit, whether it’s the jar, find directions to how to get it on Maven, or get to the documentation. The documentation is very extensive, so I’ll spare you this today. And as I said, Ben is gonna be talking a lot about some of those features as well.

What kind of what would AutoML be without some model selection and model tuning, so obviously, that’s included and Ben’s gonna talk us through some of that in a few moments. But basically, this is the progression of model selection, tuning and some of the extra features that are in there.

This model selection process, actually also allows you to employ really great steps really great features like auto stopping to save you some money and time when you’re training your models. It also does a really nice, a really nice other thing which is to, excuse me, is to explore as much of the hyperspace as possible. So when you pass in your hyperspace parameters, it’s going to maximize your coverage of that exploration to make sure that we try to find we find a real global minimum as best we can. Then it also include an inference pipeline to simplify scoring. And one of the things that we wanted to do as we progress this is to make it very simple to handoff between use cases between teams and to DevOps when you’re ready to productionize.

(clears throat) It’s also not realistic to try and do that all without any tracking. The number of versions between features and models is extensive.

And it’s also it really impossible to track all the different experiments across all the teams without some kind of tool, and that’s why we integrated directly with model or sorry, integrated directly with MLFlow using the model tracking toolkit that we have through MLFlow.

Supported Models

We support several models and they’re really the base ones that are supported in Spark. We might be missing one or two here, but for the most part, these are the most common in Spark ML, and I’m gonna show you some ways to even use models that are not included in Spark ML and how I don’t know I can help you with that as well.

Transparent Configuration

One of the in my opinion, one the greatest features of AutoML, even though it’s probably not bands, but my favorite thing is that you can actually override the configs, and you can actually make it your own very quickly. So the idea here is to, at the team level, at the organization level or a use case level, you build up a strong baseline default of all the configs that you need. From there, you just use them and override them when necessary. This makes it very easy to reproduce results and track it through the pipeline enter the process.

(clears throat) And yes, there’s Python. So we do have a Python wrapper thanks to one of our colleagues, Mary Grace Molesta. And you can see an example here on the right where there’s an example of the Python and this is end to end implemented at a customer, and you can see it’s only a few lines of code. Most of the lines of code are the imports and the data acquisition from the source. The AutoML stuff is very, very small, and that was the goal. So as you can see here, we actually just overrode a few of the defaults here for this customer. And because we were doing a test in this example, this is what we needed to override to do a few tests and see if we can improve our results. And then this is what it looks like when we are ready to hand it off to DevOps. You just submit a ticket to the DevOps team with it, MLFlow run ID, and they can take it from there. This is also how you can share your results and your models with other people as well.

Decoupling Workflows

What I’m gonna do for the rest for the remainder of my time today is demonstrate how you can decouple your workflows to save time and money.

There’s a lot of work that goes on before model training. And in fact, as it’s been shown on many of the slides and stuff I’m sure you have seen this week. There are there is a lot of work and most of the work in fact actually happens before the model starts. That’s where a lot of the money goes into the Feature Engineering Pipelines and the Feature Importances Selection, Feature Selection and then a lot of those testing. So what we don’t wanna do is rebuild our features every time we wanna run a model and we don’t also need to select new features every time we run a model. So the pipeline in AutoML Toolkit allows us to separate those things, and I’ll show you that in my demonstration.

Demo Dataset

Today for the demonstrations, we’re going to be using the Wine Dataset. The Wine Dataset includes 178 instances of Wine, those are the attributes I think across 13 different attributes that will help us determine that helps people determine whether the wine is going to be good, bad or ugly. Those are represented in a class of 01 or two. This makes it a multi class problem, and we’re going to explore that today in AutoML.

As I’ve been doing some of these demos, throughout the challenges we’ve had in the world the last few months, I began to notice that sometimes it’s really easy with editing and other things to forget to really explain what you’re doing. So I thought I’d take a moment and actually explain what I’m going to do so that it makes more sense and it’s easy to follow. First, I’m going to load some data, then I’m gonna prep that data for the models, I’m going to then improve it with some synthetic data that’s gonna be generated through some of our features in AutoML Toolkit, such as Feature Interactions and SMOTE. Then I’m going to identify the most valuable pieces of data that’s going to help us and so we can reduce our noise and ultimately improve our result.

We’re gonna then train a bunch of models, and then we’re gonna get some awesome results.

Without that, with that, let me go ahead and kick off the demo. Hey, thanks for joining my demo. Welcome to this, it’s kinda weird during the recording, but bear with me and together I think we’ll get through it. So just to get started here, I wanna show you a little bit about the setup. Really what I’m gonna show you right now, because of the time constraints that we have is these imports here. Jas is gonna talk more about this stuff in a little bit. So to get started, I’m just gonna run this and I’m gonna set up a few configs that I’m not gonna go into right now because it’s more than we have time for, but set some configs up, and now we’re going to get some data. This is the Wine Data Set, and what we’re trying to do is predict the class. So this class will be predicted based on these attributes of wine as being good, bad or ugly, I guess you could say. The next part of the process is to start to scrub your features and prepare them for the model that you’re going to use. And granted, different models require completely different types of feature engineering. But in this case, we have a couple of trees and one logistic regressor that we’re going to use to be a classifier for us, as you can see on your screen here, and as we mentioned earlier in the presentation, there are a lot of feature engineering features that are included inside of AutoML. As I encourage you to look at those in your own time and there’s tons of documentation as we talked about before, but in this case, we’re just going to use three different types of feature engineering for this and one split feature. So the first three is just filling RNAs as we always have to do, and we can’t send nulls into the model, right, so there’s that. And then we want to make sure that any, any columns that don’t have any variants, we remove those, which obviously, this data set doesn’t. And then we’re also we’re gonna turn on something fun here called feature interactions, that’s going to enable us to generate some synthetic data, if there are any features that are interacting. And Ben’s gonna talk a lot more about that in a little bit, so I don’t wanna steal his thunder and he’s gonna do a lot better job explaining it than I will. So there’s that and then the last thing is we’re gonna split it, there might be a little class imbalance in here. So we wanna make sure we split it but we split it in a stratified way so that we keep our labels consistent. So we’re just gonna go ahead and kick this off, it should run pretty quick here and if it doesn’t have we have the magic of editing, right. But all we’re gonna do here is just generate a config and we can talk more about the config later but the config is just a set of overrides that we can do from defaults. And so I have a few pre configured here, not many and you can see some of them are just right here, it’s just a map that specify some parameters for us. So in this case, I’m just going to take our defaults for XGBoost, as that’s what we’re going to use to identify our feature importances as per the splits on a trained model, okay. And then we’re going to just override some of the feature engineering pieces that we want to enable. And essentially, what we’re going to do is we’re just going to run through and do those, those pre steps before you can run any new models, then we’re going to repartition and stored in cash. And if you look through here, you can see all the details about what happened, the input output columns, but one thing that I wanna call your attention to is that now we have 92 columns. Well, we have a couple more than 92 but before we only had 12. So that’s the synthetic data that was generated based on the feature interactions that were identified. So that’s a lot based on 12 features to generate 92, so probably not all of them are necessary but just assume that they were for the moment. And in the event that they were and we wanted to reuse them several times later, we could actually save this data frame down. And when we save it down, we’re going to be able to pull it back in later and train models that don’t even exist and are not supported inside of AutoML. And you’re going to be able to send them to things like GBM or elsewhere, to do other kinds of specialized trainings on it. But this allows you to do your feature engineering in this format in a much in a fast pipeline type way, okay. Okay, so back to the 94 columns that we have here. Now, it’s basically the original columns that we had, plus a whole bunch of feature interactions that were identified, and then we have our features our feature vector. So essentially, it’s the same columns with the feature interactions and feature vector, where any of the feature engineering that we did to the columns are gonna be already done in the feature vector. So we don’t probably want all 94 of those columns, it’s going to create a lot more noise than what we want. And as such, we’re going to try and identify which of these feature interactions are the strongest and are providing the most signal, basically the most new signal to our data set so that we can best predict what type of wine we’re going to have based on those attributes okay. So the first thing we’re gonna do is, we’re gonna kick off this feature importances run and while it runs, I’ll talk to you about it. What we wanna do is minimize that noise, those features that don’t actually give us any value, we wanna take those out. So to do that, we’re going to train an XGBoost model and from that trained XGBoost model we’re going to use it splits to identify which features are most important to deriving that predicted value. So we’ll just watch this churn here, but a few notes I have a few extra overrides here. And what I’m doing is I’m turning off that feature interaction flag ’cause we already have those features. And then I’m just keeping the stratified split, and I’m gonna go ahead and turn on auto stopping here. So once we get to 90% accuracy on this model with all these columns, I’m just gonna go ahead and stop, make it go faster and save us some money, all right. So just like before we generate our config from those overrides, and then we wait for this to train. And hopefully it won’t be long, because maybe it will find an answer early. But either way, you just kick off the feature importances run through this API endpoint here. I’ll be back with you once this is done. Okay, that didn’t take too long. So we trained through what looks to be about two generations of optimizations for this genetic algorithm correct. The second generation here, and through that, we identified some features that hopefully we identified some features that were valid for the prediction. So first, let’s take a look at them. So we can do that just by looking at the importances, and here are the columns that were identified as relevant and quite an interestingly, we see that there are some feature interactions that were that are helping us to derive our value. And whether or not we hit 100% accuracy or not on this, I don’t know but you can see the scores here. And we could also check them out in MLFlow, but we’ll demonstrate that a little bit later. Nonetheless, we definitely hit at least 98, and there’s our one, so we did actually get a run with 100% accuracy with all those columns. Now, of course, I didn’t do a whole lot of key folds, and I didn’t do a holdout set and do all that. So it’s not, it’s probably gonna be a little over fit, but this is feature importances, they won’t change that much. So yeah, it’s a little over fit, for sure but we’re gonna take out a bunch of those columns. So that’s what we’re gonna do next, we’re going to identify these columns that are valuable for us and then we’re going to select only those columns. So now we’re back down to basically what we started with, 10 12 columns, something like that. However, now we have the columns that are identified as the most important, and we generated these columns here, these feature interactive columns. Before we kick off the main run for the, the training of all these models, I wanna note something that we’re going to do inside of these models. One of the overrides I have specified here is to enable case sampling or SMOTE. And that’s going to generate additional synthetic data before we generated synthetic data on the column layer now we’re gonna generate synthetic data on the row layer through some fanciness. Now, I’ll let Ben cover this ’cause Ben is gonna cover case sampling and SMOTE as well, but I want you to see how easy it is. You’re not seeing anything about case sampling and SMOTE if there are complex ideas and hard to implement, but you’re not seeing anything about it. Literally inside of these configs all I have is case sampling is stratified, sorry, split method set to case sample. So now we’re gonna go ahead and kick this off, but I’m going to do a little time lapse here so you can see a train but not to sit here and listen to the crickets. And I’d love to play some music for us, but I don’t know if I have licensing or copyrights. So we’re just gonna sit here and I’ll speed it up so it’s not extremely awkwardly boring. But before we get started, I thought we would share a few goodies, here there’s a starting generation. So you know this is a genetic tuning algorithm. So that’s the generation and then below, you can see some of the configs with regard to the hyper parameters that are set there. And there are a lot of other things going on and this is where like the standard output is for everything that’s going on during the training cycle. And this is actually quite interesting and I sometimes just sit there and watch it for a little bit to make sure that my model is working okay. So this is a really valuable piece as it can help you understand what’s happening under the covers while your model’s training. So without any further ado, let’s go ahead and speed this thing up.

Okay, so back to real time. This is XGBoost training finishing up here and you’ll notice at the end, it’s going to spit out some a little bit different data. And then it’s going to switch right over into randomforest because that’s been tied into our config there. As you can see here, the randomforest config is just in alongside the extra boost config, and Jas is gonna talk more about the pipelining and stuff. But yeah, this is gonna help us a lot when it comes down to running one model and then running another and then another and testing them all with their own hyper parameters and whatever, to ensure that we can, turn it on at night, walk off and come back tomorrow and have it done. So with that, we’re gonna go ahead and speed up time again and watch the rest of training.

Okay, now there are other models that finish training let’s take a look at what actually happened. So to do that, I’m just gonna pop over into here and look at my results, and I’m gonna go to my best results, Jas will show you more about MLFlow in a minute. So here’s my results, I can just take a peek at them and compare them. But for now, I’m just gonna note that several of them, I think, the randomforest and the logistic regression, both hit 100% accuracy on this, and it will, pretty pretty quick, and I have about 16 cores on these nodes. So now we’re gonna head back over to our main notebook here. And we’re going to kick off the manual version of what we just did, but a very small scale of it and we’re not even gonna do grid search or anything like that. We’re just gonna run two models, we’re gonna run, we’re gonna load the raw data set, we’re gonna run XGBoost and randomforest, but this typically winds up in some kind of loop with a grid search and everything else. And I just ran this and it didn’t finish really quickly because it was just two models on 178 rows (chuckles) so it did run really fast, but I just wanted to note the scores here. The scores here are we got a randomforest of like 0.15 weighted PR versus our 1.0 and extra boost of 0.15 versus, AutoML’s first couple minutes out of the box running 98.8, or whatever it was. So you can see it’s clearly much faster than writing all this code and then trying to do a grid search. You just pop in your config, override a few here and there, what depending on what you want to do, and you’re off to the races. All right, and as you can see through the demo, I did exactly what I said I was going to do, we loaded the data, we prepped it for forex models, then we improved the original data set with through feature interactions and SMOTE. We identified the most valuable data and reduced the noise. Then we trained a whole bunch of models and we did get some awesome results. So thanks for that, and with that,I’m gonna go ahead and take a turn it over to Ben to do a deep dive on a few of the pieces of the of AutoML.

– I’m Ben Wilson, Practice Lead on the RSA team working with Dan Tomes. And as Dan mentioned, we built a pretty cool team here with the developers that are on this call.

We also have a couple other people that have contributed to this project. And it’s been a little bit more successful than we thought it was gonna be, which is kinda cool. A lot of customers are using it but more importantly, people are able to scale their machine learning teams with automating some of the more annoying parts or annoying modeling that people sometimes have to do. So what I’m gonna be talking about today is, as Dan mentioned, I’m the resonant nerd of this group, the algorithms dude. So I’m gonna be going through a couple of those things that Dan showed off, particularly what case sampling is, why it was built, as well as the feature interaction aspect of this. And we’ll do a quick overview, high level, maybe mid level of what the actual tuning algorithm is, how it differs from some of the other ones that are on offer in the space.

And why it might be more successful for some of your use cases than other ones you might have tried in the past. We’ll also talk about performance, and how to melt copper on the cluster.

So the first thing we’re gonna be talking about is case sampling. And we’re gonna over before we get into the details of it, we’re gonna talk about the different aspects of how we can split a classification task up for training and test or validation throughout AutoML.

Data Splitting

So for classification tasks, and we’ll talk about regression at the end but for classification, there’s four main ways that we can actually split our training test so that we can actually use AutoML to do hyper parameter tuning. Now there’s a fifth one that’s not listed here, which is random. Random is, as it implies, it’s out typically, mostly people do splits, where they’re saying, just give me some random values from my data set, and I want 80% and train 20% test or 70 30, whatever you so choose and it’ll just randomly pick that and which that’s completely fine for most regression tasks. But for classification, you run the risk of selecting too few samples in either train or test of a potentially minority class. So what we implanted over a year ago now is a stratified split. That’s the first sort of approach you should look at based on class imbalance and this is either through a binary or multi class classification task. Stratification is gonna look at your data, it’s gonna do some approximate counts of how many of each class that you have, and it’s gonna try to split that in accordance to those ratios. So that if you have 90% of your data is class labels zero and 10% is class level one, it’s going to maintain that ratio in your train and test. So you’ll still get that 90 10 split between those class imbalances. This also works for more complex multi class classification, where you might have a 70 20 10 split between your classes. It’ll look at that and it will intelligently sample the data to make sure that you have that same balance between train and test. That’s all well and good until you look at some of the use cases that people use for binary classifiers in particular, or stuff like Ecommerce and financial services, where you’re trying to do a classification on a fraud model, where unless you have a really terrible business model where everybody’s trying to steal your stuff off, you’re usually going to be looking at fraud rates that are, fractions of 1% and traditionally how people have handled that is the second item which is under or over sampling. Now under sampling is taking the majority class the not fraud in this case, and reducing the row count of that. It’s considered a destructive task, because you’re removing signal from your actual model training stage. And for some use cases, that might be fine, people have kinda adapted to that over time, but you’re losing signal. And the reason you’re doing machine learning on Spark is so you can use all the data that you have available to answer your problem and get a good model. So we’re not gonna really talk about that too much. It is available if you choose to use it but there’s some drawbacks to it. The other problem is people trying to boost the minority classes by copying or replicating the actual signal. So you take that one tenth of a percent of that minority class and you just copy that data 50 times to try to give it enough row count, so that the actual classifier can learn that information. That introduces a whole different problem, which is bias in your model. So it’s gonna learn that one class with those particular characteristics really well, it’s not gonna adapt to unseen data that well. So you may get okay results on a test set while you’re doing hyper parameter tuning but then when you actually run it in the real world, or against holdout validation data, it might just completely fall apart. So you could get a really good accuracy or an f1 score but as we all know, those are kind of garbage in class imbalance. So in a binary classifier, you can look at area under ROC and you are like, wow, I got a 99.1% area under ROC. It’s because it’s successfully classified the 99.1% of zeros, and it could not learn the minority class. So that’s also kinda bad. We also have a chronological split, which allows us to set a date or a date time field, and you can specify a percentage of the distribution of that time series to make a split or a cut, mostly using regression problems, not gonna be talking about it too much but that’s also there. What we are here to talk about is what Dan already showed and what Jas is gonna show also in his demo, which is case sampling, which means almost nothing to anybody in this audience, but what it is a distributed implementation of SMOTE.

So at the root of what SMOTE uses to determine how to actually build the synthetic data is KNN and KNN is a great algorithm.

It’s far more accurate than using some of the other algorithms that are out there for clustering like K means, because it’s able to do the associative distance measurements between elements within your data set. The drawback to KNN though, is that all your datas have to be on the same physical machine.

Now people have tried to do and implement SMOTE and get it working on Apache Spark on a big distributed system. The problem is that it’s very not performant when you try to implement this algorithm, reason being is because you have to move all the data from all the executors to all the other executors. So you need this consistency of state on a single physical machine in order to do that calculation. So lots of shuffling, very expensive, takes forever on runtime. So because that’s not really an option for us, what we decided to do was think about what is the actual solution that we’re trying to achieve here, which is synthetically generate data that is similar to other elements of these imbalanced minority classes, but how can we do it in a distributed manner?

K Sampling in Detail

And I spent about six weeks breaking this 1000 different ways before I finally figured out a way to actually do this, that actually works, shockingly. And we’re gonna go through the steps right now about what this algorithm kinda does at a relatively mid level. So first thing is this happens after we’ve already generated our feature vector from all the processing steps that happen. We’re indexing strings, we’re creating, one hot encodings. Whatever we’re doing in order to create that feature vector. This happens right before we’re gonna go into modeling. It’s gonna take that feature vector, regardless of if we scale it or not. And it’s gonna apply MaxAbs scaling and this allows us to constrain the space that we’re gonna be searching and normalize the data so that our distance measurements that we’re doing throughout case sampling are actually not influenced by varying degrees of magnitude of the doubles that we’re looking at within those vectors. We scale it, then we build a K Means Model, which is a distributed algorithm. We’re just using Apache Spark ML library to do this. But the reason we’re doing that is because we want centroids and those centroids we wanna be able to look at the data frame after we’ve already transformed it with that K Means Model. And we’re gonna go through a quorum voting algorithm, we’re gonna say for each of these K clusters that we have, and each of these centroids, determine which ones most accurately describe the minority classes that we wanna boost the signal of. And we’re gonna go through iteratively until we vote on a candidate pool of K clusters, then we’re gonna get those centroids from those K clusters.

And from those we’re in, sort of in series, we’re gonna be building a MinHash LSH Model, which that’s gonna allow us to do a rather efficient distributed means of getting distance measurements. So we’re gonna apply the centroid vectors into the MinHash LSH and we’re gonna attempt to find nearest neighbors to those centroids. So it will search in the N dimensional space around it and it’s gonna capture all the minority classes that we’re interested in. We’re gonna collect those vectors and then we’re gonna recurse over those vectors, those candidate minority class that they already have a real data, and we’re gonna start mutating them with the centroid value itself and that’s done through. There’s a number of different parameters that you can tune in AutoML to control how that behavior happens, whether you want to fully randomize it, or you wanna control in a linear manner, just a couple of the actual vector positions to mutate within that dense vector. Once we’re done with all that, and we generate either a complete match to the minority to the majority class, or we hit a percentage, which, from me, the person that developed this, I do recommend using a percentage don’t do a match on extreme imbalance, or you’re gonna need a bigger boat. It’s going to replicate your data up to that point and you could double that your data size or even more. So usually a percentage of that, get it so that it’s not such an extreme class imbalance. And once you get that, those numbers of synthetic rows that have been generated, we’re gonna flag all of those. The reason we need to flag them is because we cannot test on those, we can’t use them for validation. That would give us an incorrect score. So what we wanna do is only apply that toward our training set but mark them as synthetic. So that all the following stages everything in the pipeline that Jas has built and is integrated with all of that is ignoring those rows of data. It’s just for the algorithms to tune against to get more information.

And this is a visualization of exactly why we built this.

Why did we build this?

This is a typical data set that I’ve seen at many customers who are trying to do fraud detection or trying to detect some sort of anomaly in their data, and they’re using a classifier approach for it. It’s really hard to find that signal and this is kind of showing just how much of an imbalance that can be. So with case sampling, we can take those red bars, and we can bump them up to a certain percentage, so we can learn that signal a little bit better. Important thing is in the validation stage, we’re not bumping that bar up, because that would be sort of unethical and we would get a synthetic score that’s not reflecting the actual data.

Another thing that Daniel talked about and showed off in his demo, something that Jas is gonna show up as well, is feature interactions. (clears throat) This is something that a customer actually asked me a question about, because they were doing this manually.

What is feature interaction?

And they actually convinced me through showing their results, that this is a really cool technique, particularly if you have a sort of a problem to solve in machine learning that you can’t use ensembles due to explainability concerns, or it’s just too much of a runtime impediment when you’re doing inference to use some sort of stacked ensemble method. So they showed me, hey all we do is just take a product of these things, each of the features that we think are gonna be important. We score them, validate them and then we manually put them into a data science ETL pipeline for feature engineering.

And I went off and did what I normally do, which is read a bunch of papers and look at a bunch of blog posts and ask a couple of people and say, hey have you ever done this? Turns out, there’s a couple, a couple people that I asked that actually said, yeah, this works pretty well, in some use cases. But it’s kinda hard to control what you’re actually introducing. The one thing you don’t wanna do is just blindly add everything, because you can introduce so much, noise and variance to your signal that you can get some really bad results and destroy your model. So I said, okay, I’ll build it three different ways.

But the important thing is, this is not something that we built from scratch this is implementing something that other people have thought of, we just did it in Spark.

How does it work?

So at a high level, this is what’s actually going on feature interaction. We’re taking each feature from that data frame for that feature vector that we’re about to create and we’re gonna just multiply them together. And we’re gonna create a hybridized feature so we take feature ‘A’, feature ‘B’ multiplying together, we get feature A,,,B then we’re gonna use basically what decision trees use for the validation steps. So for doing a classification problem, we’re gonna calculate entropy on a split criteria. We’re gonna say how important is this feature, or explaining basically, these classes that we’re trying to predict. So it’s kind of on a decision tree as the first layer the first round of splits are doing, we’re not actually recursing down through the tree and doing repeated, splits like we do with decision trees, we’re just doing that first layer. And we’re calculating entropy for classifiers. We’re calculating variance for regressors. And we go through and do some checks, which we’ll talk about in the next slide about which ones to keep and which ones not to keep based on our settings.

Process Steps

So as I mentioned, we have this different modes that are available for feature interaction. Now, the one that I recommended not to use is, of course, in here, it’s definitely not default, but it’s the all methodology. And that’s if you’re doing something like what Dan was showing, which is I wanna do feature importances. I wanna just see what the results are for everything that I could possibly test. It’ll do with the interactions of every single features every other feature and then you can evaluate it statistically, you can do your own tests, you can see oh, there’s some sort of signal here that I didn’t think was there. So it’s an information exploration tool.

For all other cases, we have by default optimistic set. So if you turn it on, you don’t override anything optimistic is gonna go through, and it’s gonna check each candidate. So the DOD product features the feature A underscore B that’s created, it’s gonna get the actual information gains. Scoring from that split, it’s gonna compare it to both parents. And if it’s as good as X% of the information gain of its parents, either one, either A or B, it’s included.

That’s optimistic mode. Strict mode is a little bit harder to actually get interactive features to be a candidate. (clears throat) And what that’s going to do is still gonna do the same check. It’s gonna check A underscore B to parent A and parent B. But it has to be X number percent as good as the information gain of both parents.

So it’s an NN right Jas.

Alright, so the final thing that I’m gonna talk about is gonna be how this thing actually works from a geek level.

Stages Genetic Algorithms are fun

And there’s four main stages that occur within AutoML for its tuning algorithm. The first one is seeding a gene pool. And what this is gonna do it is explore through permutations of a search space that’s defined. And these are the things that you can actually override, or just leave the defaults. We have ranges that are set for each algorithm on each of the hyper parameters, and we also understand what the distribution type is of those hyper parameters. Some of them are normally distributed, linearly distributed, some of them are exponential, some are logarithmic. So we capture that search space that we’re gonna be looking at, and we generate a bunch of hyper parameters. We then and you can control how many to start from that initial gene pool. Then we actually start running these, but they’re all run in parallel. So we create a new forkjoin pool, we set a level of parallelism that occurs the driver will then asynchronously through features kick off a bunch of modeling runs on the workers. These will return out of order in return when they are done doing what they’re gonna do, as one returns, another one will start up because the Fortran (breaks out) in a process, just that number of what you said is parallelism to execute concurrently. When all of the models that for that first generation have been completed, and then return back their results, we then go into the evolve stage. What that’s gonna do is take the top end models from that preceding layer, and it’s going to generate a bunch of new things to test. And then it’s gonna mutate the best conditions from the parents that were retained from previous generation with these new synthetic hyper parameter collections. And it’s gonna go through the same process as the first one did, it’s gonna asynchronously kickoff N number of these in parallel when all of that pool finishes and returns, we go to the next stage. And you can set through configuration, how many layers of generations that you’re gonna test in this manner. And there’s other settings that you can say I want to reduce the amount of search that happens each generation I wanna, mutate less and less in order to try to converge on the best condition.

And after we’re done going through all of those generations, we’re going to basically gather all of them everything that has been tested as well as the metric that you’re scoring against, and it’s gonna create a relatively large data frame that we can then train a linear model against. And this linear model is going to be used to attempt to figure out what the best optimized parameters are across the entire search space. We do this by generating a truly astronomical number of permutations of hyper parameters, through a synthetic data set. So we by default it, I believe it’s 400,000 rows, or 400,000 combinations of hyper parameters are generated, you can take that all the way up to a couple million if you want, might not wanna do that over about 2 million, but it’ll generate a lot. And then it’ll apply the model that it trained on our a prioritic results that we got from the genetic algorithm to try to predict what is gonna be the best result of hyper parameter combinations. And then we actually take N number of those predictions and build those models. That’s the final training stage, the final generation. And once that’s done, hopefully we get the best model that we can. Now finally, I’m gonna talk about a couple of the new upcoming features that we have that are sort of planned on our roadmap.

Roadmap features

One of them and I alluded to it before, which is Stacked Ensembles and we already have the POC under that, sort of the MVP. It’s not in the AutoML Toolkit yet,

but we are working on it. And this is gonna allow us to do a bunch of weak learners of different model types into a first layer, and these are all gonna be tuned through AutoML. So we’re gonna get the best possible model that we can for the first stage of that. And then on top of that, we’re gonna take those pipelines, we’re gonna glue them together the one big master ensemble pipeline, and then throw an additional either regressor classifier on top of the predictions of those. And based on our testing, it beats everything that AutoML can do on its own as of right now. So it’s definitely something we’re excited about working on and pushing out to the public with that. The other thing we’re working on is an improved method of the slide that I just talked about, which is genetic algorithm 2.0.

And one of the things that we wanna do is implement something that is pseudo SGD-based, so when we’re actually searching for candidates in each subsequent generation, wanna make it a little bit more intelligent, so we wanna predict where on a fitted curve, the next set of hyper parameters should be tested, and also search for spaces within that search space that haven’t been adequately tested yet.

So that we can explore the space more comprehensively before that final tuning phase where we say, give me all the data now and tell me which one’s probably gonna be the best combination. And the final thing, it’s currently partially in AutoML, but there’s no public facing API for it, but we are slowly adding to it. And it’s something that is gonna help out for some of the stuff that Dan was showing on his demo, which is we generate these features, how do we analyze them correctly? And how can we get better visualization and statistical readout reports from things that data scientists care about? Stuff like I wanna see my relationship of correlation, I wanna check my core, covariance reports, I wanna do an ANOVA on my data for a particular column, based on my class labels. I wanna see, is there a strong signal here that I should be focusing on and building additional feature engineering steps before it even gets to the toolkit. So we wanna sort of give out those tools to the ML community to do this stuff on Spark without having to revert to sub sampling the data to such a small degree that you could actually lose some of the information that you’re looking for. And also stuff like automatic PCA analysis of the feature vectors to determine if this candidate feature engineering set where you’re trying to push through AutoML is even worthwhile to try to predict what you’re trying to predict, and see if there’s an actual signal there. So on that note, I’m gonna turn it over to Jas for his pipeline demo of his awesome codebase. – Oh, thanks, Ben and thanks Daniel that was awesome. Hello everyone, my name is Jas Bali. I’m a senior solutions consultant with Databricks and also a core developer on the AutoML codebase. So for the next demo, I’m gonna be walking you through some of the stuff that Ben talked about, and also show you how simplified it is to build self contained prediction pipelines out of AutoML codebase, let’s get started. All right, for the next 15 minutes, I’m going to talk about and walk you through AutoML family donor API’s, and we’re gonna see how easy it is to productionize your AutoML pipelines. Let’s start with getting some data set. In my case, I’m going to use Wine Data Sets published in UCI machine learning repository. and I’m particularly going to use Wine Data Set also who doesn’t like wine right. All right, so let’s look at what the data looks like. So in our case, it looks like

contains some chemical analysis of wines, and our 13 feature columns. And we are predicting what particular grape wine as this wine belongs to. So that’s what our data set is. Next thing we are gonna do is import few important classes from AutoML library.

As you can see, there are three classes that I’m importing configuration generator, family runner and pipeline model inference. For configuration generator, it’s gonna be helpful when we have a map of configurations and we want to convert them into an immutable scholar object that is then fed into the family runner to run our modeling part. So the next part, the next class is the family runner. It’s the main entry point into using any of the pipeline API’s. And then the third class is pipeline modeling fronts, which we are going to use to do some loading of pipeline models to do inference. And then the next part is initializing few multinomial classification tasks here. So what we’re doing here is, as you can see on my screen,

I’m going to train and optimize full models using AutoML logistic regression, randomforest, decision trees and extra boost.

In my case I’m defining two configurations, and the first one is, as you can see, there’s some important flex in there. So the first one I’m trying to set is the case sampling one, as you heard Ben talk about case sampling functionality of algorithm, this is how you use it. And then label column is obviously needed.

And then I’m also turning on feature introduction flag, and I’m saying the more I want to use this mode so you heard them talk about mode, optimistic and strict. In my case, it’s all the feature rectory is gonna be big but it has been seen offline. So let’s see what it looks like.

Another configuration is setting is scoring metric is f1. There are defaults for all of these configurations, and there are a bunch of configurations around AutoML that you can tap into. I recommend going to AutoML’s public repo on the API docs and if you have any questions on what each configuration is, you can go here and find it out and then tune it as much as you want. The next configuration is similar one but instead of f1 scoring metric I’m sending integrated result so I’m going to use this particularly for extra boost. The next part is taking all of these four configuration maps and creating immutable scale objects from them. So as you can see, that’s what this cell does. And then I’m wrapping that those contributions in an array, which are needed for family donor. The next step is I’m instantiating family runner I’m passing in the training data set. And I’m also passing an array of configurations that we just saw. And then I’m just doing execute with pipeline API. It’s that simple. Behind the scenes, it’s gonna take all of the configurations set up, create Spark ML pipeline, like semantics, and concurrently run them on your cluster. So in this case, as you can see, it took a while to run. Normally this is a part when I go and get myself a coffee. Depending on how many what your configuration is and what your training it can take a while. As we will shortly see it’s gonna

it actually run around 880 models for my four configurations that are costing. The next part is we what we will do is we will look at some of the MLFlow stuff that AutoML internally locks. So, behind the scenes, AutoML logs every generations on parameters to MLFlow

and then it also logs the best model that it finds in the end. So to that part, you will see that I have two experiments that have been created by AutoML. One is MLFlow locks and the other one is MLFlow locks best. Let’s take a look at what extra boost generational runs look like. So as you can see here, we, it ran around 219 total runs.

And as you can see here, we have around 12 generations that ran. And there’s a clear generational increase in the metric success. So as you can see, on the left, we have non credit generations on the right is an accuracy of around 98%. And similarly, we have it for an extra boost as well. So as you can see here, there’s a clear trend of generational increasing metric success here. So on the right you can see so 97% for extra boost. Okay, so now those were the runs for the generations. Now in the end, when everything is completed, AutoML also logs the best models for us. So we will see that if you click on this experiment, I’m particularly I’m looking at extra boost. This is the best run for the extra boost model. And as you can see the pipeline API also logs some interesting parameters here. So it basically logs every stage because is there a lot of functionality there on feature engineering that is part of AutoML. There are a lot of custom stages that we have that are not available in the spot in the standard Spark ML. So we have a lot of standard transformers that are part of the AutoML pipeline. And you can see that the best pipeline model that we are seeing here has some additional configurations that have involved as well. So as you can see here, each stage has some interesting metadata around it. We can see what the input data set looked like. We can see what the conch was. We can see what the schema was what made these transformations work. It comes in handy when you’re troubleshooting, and you are trying to figure out what each stage does. So all of that information can be found in your best run experiment or a particular run of that model. And as you can see here, there’s a plenty of stages that have been involved with all their parameters. And it also tells you how much each stage took.

It tells you sorry about that.

So it tells you total stages executed are 25. It tells you what the execution stage is for each stage for each of the pipeline stages. So there’s a really handy instrumentation around pipelines that you can find here, in case you’re interested and you’re troubleshooting something. Next thing is

I’m gonna show you now let’s look at what the return of the family unit looks like. So as you can see, this run finished and it returned back to pipelines. So it looks like it return back, a map of string and pipeline model type. So it basically gives you a pipeline model and it gives you obviously it’s the best pipeline model for each model type, and also the best MFFlow Run ID. So now you can use it to run some predictions. So in our case, the first recommended way is to do it by MLFlow Run ID, that way you’re not messing with any of the bots, you’re not manually loading any models. Or what this does is, as you can see here, in summer 24,

I’m passing in the best MFFlow Run ID and I’m giving some basic configuration that is needed to connect AutoML’s service. So it’s gonna do a look up on that expressed experiment name. It’s gonna pass in the MFFlow Run ID. It’s gonna read the AutoML text on it. And it’s behind the scenes, it’s gonna find out where the pipeline model has been returning to. In our case, it’s you can think of it, it’s a basic, it does a basic lookup, and load said pipeline models for you. That’s all you need. In the next step, let’s look at some of the stages of these pipeline models. So as you can see a lot of AutoML’s specific custom transformers we have.

And all of these, these are specifically for our demo, but they conform with the standard Spark ML pipelines implementations. So these are some of the custom status for this particular configuration that we set. And then I am, all I’m doing here is, as you can see, by best by playing model, don’t transform and imposing a data set to predict, and then it’s just doing a display on it. As you can see here, we have a prediction column here. In our case the prediction comes out to be a number ’cause that’s where we put it in training. But if you actually had the actual label, the pipeline API’s will internally convert that into an index and then apply the indexers index to string back, so that prediction is in the right level. The next part is running inference using the pipeline model directly. We don’t recommend doing this, but for some reason, if you are not using MLFlow, you can directly get access to the pipeline model that is returned by the family runner API. So as you can see here, I’m loading the logistic regression pipeline model and doing a transform on it, that’s what this one does. And we can see here that we have the predict column.

So those were, that’s basically how you run inference one of the two ways. Obviously, you wanna use MFFlow, but if not, you’re open to using the pipeline model as well. Next thing is, I quickly want to talk about the feature engineering API. So if for some reason you’re not looking to train and optimize with the AutoML’s algorithm, you are free to use AutoML advanced feature engineering pipeline, and that’s what this allows you to do. So what I’m doing is I’m defining a basic configuration here that I need for my feature engineering. Again, you can go to the API docs on the Open Source AutoML’s repository, and you can find out what are the additional feature engineering steps you can use, like variance, covariance, cardinality, there’s a bunch of stuff there that I would recommend take a look at before you start using it. So this is gonna do that for you. So you’re passing in the configuration for running the feature engineering pipeline and in the next cell 33, I am parsing in that configuration and I am running generic feature engineering pipeline with the verbose option is true. So what this is gonna do is it’s not gonna train or run any genetic algorithm, it’s just gonna build your feature vector that you can then use offline to train and model it.

The verbose flag is used for is used for telling our colonel whether you want to see all the intermediate columns that was invaded before the final feature vector was emitted. So in this case, as you can see, the features vector is there, and you also have a bunch of additional columns. But if you don’t want that, you can just turn it as false, and it won’t do that for you. So it returns again, the return of the feature engineering pipeline is also a standard Spark ML pipeline that you can use without needing anything else. With that concludes this demo, I hope you enjoyed it. – And there you have it. That’s AutoML at 50,000 feet. Yes, there’s a lot more that we didn’t cover. We have nowhere near enough time to cover any of these aspects or do any of it justice. But I encourage you to go out to Databricks Labs, check out the toolkit download it on Maven, PyPi will be coming soon, get involved. Help us fix all this stuff and implement it all.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Daniel Tomes


Daniel Tomes leads the Resident Solutions Architect Practice at Databricks and is responsible for vertical integration, productization and strategic client growth. His big data journey began in 2014 at a major oil and gas company after which he moved to Cloudera for two years as a Solutions Architect and in 2017 join Databricks.

About Ben Wilson


Ben Wilson is the creator and lead developer of Databricks Labs AutoML. He currently serves as Practice Lead within the Resident Solutions Architects group at Databricks, specializing in Machine Learning Engineering and Data Engineering. Prior to his current role, he was the Data Science architect at Rue Gilt Groupe. His interests are in automation, concurrency, and creating solutions to ease production deployment of ML projects.