In this session, you will learn how to scale their exploratory data analysis and data science workflows with Databricks. You will learn how you can collaborate with team members writing code in different languages (Python, R, Scala, SQL) using Databricks Workspace, explore data with interactive visualizations, and discover new insights, securely share code with co-authoring, commenting, automatic versioning, Git integrations, and role-based access controls. You will learn best practices for managing experiments, projects, and models using MLflow. Attendees will build a pipeline to log and deploy machine learning models to production.
This session will be “follow along” – you are welcome to try running the notebooks yourself ‘live’, but it is not required. They can be re-run later as well. If you want to follow along, download the notebooks from https://files.training.databricks.com/classes/data-science-on-databricks/ . We recommend downloading the version with solutions.
For access to Databricks, sign up for free at https://community.cloud.databricks.com/ . Import the notebooks and provision a cluster using Databricks runtime 7.3 ML.
Joshua Cook: Welcome to the session, Data Science on Databricks. In this session, we will model a data scientists work at a fictional health tracking company called moovio. Moovio sells a wearable health tracker device that collects data for its users to monitor their physical activity. We will create and explore an aggregate sample created from user event data. We will design an MLflow experiment to estimate the bias and variance of several models. We’ll use EDA and estimated model bias and variance to select a family of models for further model development. We have access to daily user health event data, including body mass index, maximum VO2, active heart, rate resting heart rate, and user provided labels of their lifestyles. Let’s get started. I’m going to be working through this course using Databricks Community Edition. You are encouraged to do the same while watching the webinar. To get started, the first thing you should do is create a new cluster. Here I am at the Databricks. I’m going to click the clusters tab and create a cluster. I’m going to give it a name based on today’s topic and select the latest Machine Learning Runtime version.
Here, we can see the cluster coming online. To follow along, you will need to make sure that you have downloaded the DBC containing the course materials. I have downloaded these materials to my desktop. I’m going to navigate to my home directory, click the dropdown and select import. I’m going to import from a file. I’m going to go back to my desktop and drag the file to the upload area. When it’s ready, I’m going to click import, and that should import the notebooks we’ll need into my workspace. In the next notebook, we will run a few utility functions so that the data we will be working with is available to us in our workspace. In a typical workflow, data will have been made available to you, the data scientist, as tables that can be queried using SQL or PySpark. This notebook mirrors a typical workflow where the data has been made available to you by a data engineer. I’m going to get started with the zero zero getting started notebook. I am here the homepage of my Databricks workspace. I’m going to navigate to the home directory where I’ve uploaded the DBC.
From there, I’m going to go into the Python directory. Here are all of the notebooks that we are going to be working from in this part of the webinars series. I’m going to start with zero zero getting started. Make sure that it’s attached to the cluster I’m working from. Mine is already attached. Let me go ahead and detach that and show you one more time so that you can see what it would look like if it was not attached. If it was not attached, it would look like this. I would just click on the dropdown and then select the cluster that I’ve chosen for this course. As I note here, it says, before getting started, make sure that you update the notebook include/configuration with your preferred username. I’m actually going to jump over there real quick. Let me go back to workspace and select includes. There it is, include/configuration. I’m going to open up that notebook. In fact, look at that, we have our first TODO. It says TODO username equals fill this in. If I had run this, we would’ve had a problem because fill this in is not defined in this workspace.
I’m going to need to put a string here with my preferred username. I’m going to use the username that I use in most places, which is just my name, first and last, Joshua Cook. That’s going to be just fine. I actually don’t need to execute this notebook. I just need to update it, and then it will be called when we run the other notebooks. I’m going to go back there, back to the workspace, choose zero zero, getting started again. Now it says, before getting started … Yup. I’ve done that already. I’m ready to go. I’m going to kick this off. I’m going to source the include/utilities notebook. It says include configuration notebook, which we just updated, defined data paths and configure the database. The next thing that I’m going to do is make the notebook idempotent. That’s what this step is going to going to do. What this means is that if I run the notebook more than once, it is going to not cause any issues. Basically you can think of this as a way to reset everything if we wanted to run this again. I’m going to run that function right there.
It says false. That’s because I haven’t run this before. If I had run it before and there were some files there, it would say true. The first thing that we’re going to do here is retrieve and load the data that we’re going to be working with. These are two files, one called health profile data and another called user profile data. These are being made available to us as parquet files. I’m going to use a function called process file, that’s going to retrieve those functions and load them into our workspace. This function takes three arguments, file name, path, and table name. This function is going to do three things. First, it’s going to retrieve a file and load it into our Databricks workspace. Then it’s going to create a Delta table using that file. Then it’s going to register the table on the Metastore so that it can be referenced using SQL or a PySpark table reference. This brings us to our first exercise. We need to retrieve and load the data. We’re going to retrieve the data using the following arguments.
We have a file name here, a path where we’re going to be loading the data, and then a table name that we’re going to use for the table. This may be a bit confusing. It might help if I let you know that these paths have actually been defined for us in the configuration files. We can see that silver daily path is already defined for us, as is dim user path. You can see both of those paths already exists. What we’re going to do is take this function process file that takes three arguments, and we’re going to call it right here in this cell. It’s going to take three arguments. These are the arguments right here. I’m going to be lazy and just copy paste. This one here is a string, so it’s going to need to be a string. The path, as we noted, is already defined. So it’s not a string. Then the table name is in fact a string. This is going to be how we’re going to process the first file. Now we have to do this for both files. Let’s go ahead and get the other arguments here. The same thing.
This should be a string. This is a variable, and this is a string. I’m going to run that function. Here it’s going. As I noted previously, this process here is something that you probably wouldn’t be doing as a data scientist. This would have been done for you by the data engineer. I wouldn’t worry too much about exactly what’s happening here. If you are interested in reading about it, you can, of course check out the include/utilities notebook, where the code that is defining this function is written. But for most data scientists, the data is going to be available to you. You might have to create aggregates or different other kinds of tables, but getting the data to you from an external source is usually handled. It looks like this is wrapping up here. It’s retrieved the file. It’s loaded it into a location. It loaded it into this location as a Delta table. Then it registered that table in the Metastore. The same thing for the user profile data. This brings us to the end of the getting started notebook. In this notebook, we will sample 3% of our user health tracker event data.
We will then use this sample to create aggregates for each user over the sample set of users. This aggregate will be the basis of all of the work we will be doing in the first part of the webinar series. Because the data has been made available to us as Delta tables, this notebook will use Apache Spark to read and write the data. At the end of this notebook, we will write the sample aggregate to a Delta table. Using Delta tables is a Databricks best practice for persisting and sharing data. Here I am back at the home. I’m going to navigate into the project once more, clicking the home button. Here’s that DBC I loaded. Going into Python. Here’s all the notebooks that we’re working from. We’ve already done zero zero getting started. Just a reminder, if you haven’t done it already, first, it’s critical that you set up your configuration file and add your username there. Second, it’s critical that you run that getting started notebook because it’s going to bring in the data. I’m going to jump into create aggregate sample. Here we are. I am going to attach to my cluster.
Let’s go ahead and run that configuration file. It finds the data paths, configures the database. The next thing that we’re going to do is create Spark references to the data that we’re working with. We have these two files that we brought in, these two parquet files that we brought in. Now they’re registered in our system or they’re loaded into our system as Delta tables. One of those tables is user profile data and the other is health profile data. There’s a correspondence between the two. The health profile data is data corresponding to the users in the user profile data. What we’re going to do here is actually create Spark references to that data, using the names that we used when we define the table. Here I am. I’m going to go Spark.read.table. This is the table name right here. My user profile df is going to reference that table. Then the health profile df is going to reference this data here, health profile data. If you are new to Spark, this is something that might be a little bit tricky. Spark data frames are a little bit different from pandas data frames.
Pandas data frames actually loads the data into memory and a pandas data frame exists in memory. A Spark data frame is more a reference to where the data is on disc so that you can retrieve the data when you need it. It’s part of this whole lazy loading thing that Spark does. We’re not going to go too far into it. We’re just going to be defining those references here. Now, another thing to note is that, because we’re working in Databricks, we already have a Spark reference defined. When I’m working locally, when I’m using a different tool, I always have to define my own Spark reference. When I’m using Databricks, I don’t need to do that. It’s just available to us when we start up the notebook. The first thing that we’re going to do here is display the schema of the data that we’re going to be working with. Let’s print the the schema here and luckily the Spark data frame class it’s a really handy method we can use to do this called print schema.
I’m going to do that here, and I’m going to print both of those as schemas. There they are. It looks like our user profile data has some attributes associated with each user and a unique ID. First name, last name, the lifestyle they’ve identified. Remember that we have three of those here. Whether or not they are a woman, country, and their occupation. Then the health profile data is actually event data associated with each user. That ID is going to reference a user in our user table. We’re going to have the date associated with them. Then several attributes recorded on that date for each of those users. That’s the scheme affidavit that we’re going to be working with. Let’s get a count of the users that we have. It looks like we have 3000 users in that user profile data. All right. Just to get a sense of what we’re working with here, let’s get the minimum and maximum dates for all the data in the health profile data. I’m going to use the minimum and maximum PySpark SQL functions.
I’m going to get the minimum of the date column and maximum of the date column. If you’re wondering why you used dte instead of dat, it’s because dat is a reserved word in Spark SQL. I didn’t want to have any collisions, so I’m just using dte for date. It looks like here’s the minimum and the maximum. It appears we have an entire year of data. We’re going from January 1st, 2019 to December 31st, 2019. All right. If you recall earlier, there are three lifestyles associated with our users. Let’s just have a look at those. I am selecting the lifestyle column from the user data and using the distinct operator on that to see the distinct values of lifestyle. That’s running. It looks like we do in fact have three. We have the sedentary lifestyle, the weight trainer and the cardio trainer lifestyles. All right. Now this is going to be the critical piece of this webinar. We are not going to be working on all of the data. We’re going to be working on a sample of the data.
Just trying to get a sense of what the data is. It doesn’t make sense to do our exploration on all of the data. Of course, we’re going to want to use all of the data eventually, but for this component, we’re just going to sample it and look at a smaller sample just to get a sense of what we’re looking at. I’m using the sample method from the data frame class. I’m going to be sampling 3% of the data and just to get … I’m going to then group that sample by lifestyle and get counts to see how many people from each lifestyle we’re talking about. It looks like I have 25 people who identify themselves as sedentary, 35 who identify as weight trainers, and 29 who identify as cardio trainers. The next thing that I’m going to do is join the two datasets that I have together. Now we also have an assertion statement here when we’re done. Because we have a year of data for each user, we would expect that the full health profile would have 365 times as much data as the user profile. This is because for each user, there should be 365 rows in that health profile sample.
Let’s go ahead and do that joint. The two data frames that we’re looking at, I’m going to just scroll back up and grab those names. We have this one and this one. I’m going to copy those, come back here. We’re going to do a joint. The way we’re going to do this, is we’re going to start with … We want the sample, don’t we? We want the user profiles sample. We’re going to join the user profile sample to the entire health profile data. The health profile data is daily data for every single one of our users. We’re going to join just the sample to that data frame in order to get the daily data for our sample. I’m going to join the user profile sample to the full health profile data. I’m going to use the column ID to do that. Here we go. It looks like everything went well. If you’re not familiar with this assertion, the way that it works is, if that had been false it would have actually raised an assertion error.
Let’s have a look at this health profile sample that we generated. This is the sample data frame that we just generated. You can see that we’ve got different users here. So far we’ve got Sharon Selenas, Caitlin James, Benjamin Molina, William Coleman’s. There’s a lot of different users in here, different lifestyles. We have gender, we have country, their occupation. Then these are recorded here for each date. Then it has the resting heart rate for that day, their active heart rate for that day, BMI, VO2 max, how many minutes they worked out on that day. This is the data that we’re going to be looking at. But really what we’re after here is a profile of a user. We want to be able to take all of this daily event data that we’re taking in and use that to create a profile of a user that we could then use to classify a user.
The premise here is, let’s say if we knew the average BMI of a user, the average active heart rate, the average resting heart rate, the average VO2 max, if we knew those values for one of our users, we could predict what lifestyle that they would have. Therefore, recommend different kinds of workouts. “Hey, if you have a sedentary lifestyle, maybe try out this beginner level workout.” Just to get a good workout habit going. Or maybe if they’re a cardio trainer, “Hey, you should try doing these sprints next week.” The idea being that if we can take this aggregate data to create a profile of users, we can use that to actually drive their engagement with the product. Let’s go ahead and build these profiles. We’re going to do that by performing aggregations over the sample table that we’ve created. These are the aggregations that we’re going to perform. We’re going to do a mean of these four features. We’re going to rename those features when we do that to have these names. Let’s go ahead and get started.
I am going to be super lazy and do this. I’m going to show you a fancy feature that is available in the Databricks workspace, which is the ability to use multi cursors. You can see that I am selecting right there. I’m holding down command, and I’m going to select each of the places that I want to do this. Now I’m just going to type on each line at the same time, which is pretty fancy. I’m turning those into a string and then close the parentheses. I’m creating a mean on each of those, alias as these names. Let’s do this. There we go. I’m going to put a comma after each one of those. Let’s go ahead and run that. We can click on this little dropdown here to have a look at the schema of the data frame we’ve just created. I probably did that because if you’re not used to the multi cursor thing, I probably did that a lot faster than some of you might be typing at home. Let’s just hang out here for just a moment.
This is actually going to be critical because we are going to be testing that schema in just a moment. We’re going to need to make sure that we have the correct schema, including the column renamed. You are going to need to run these aliases on each of the columns to rename them using these names. I’m going to go ahead while we’re … I’m going to still give you a minute to be typing that in. I’m go ahead and have a look at this data we’ve just created here. I inserted a cell underneath. This is not going to be required. This is just a thing I’m doing real quick. Let’s collapse that. Let’s go ahead and have a look at this data we just created. I’m just running … I’m using the built-in display function that available as part of Databricks to display the data. You can see here’s all this data that we’ve created. I’ve only included these four numerical features. I do have the lifestyle associated with each person. Then I don’t have anything beyond just the ID any longer. I’ve discarded everything, but the ID.
This is this aggregate data. This is the data we’re going to be using throughout this webinar. This is super critical that we get this right. Because it’s so important to get it right, I am actually going to run this test on the schema of the data frame that we just created. I’m asserting that this data frame that we just created has this schema right here. Let’s run that. It passed. Again, if it had failed, it would raise an error. If it just goes and doesn’t do anything, that means that the assertion passed. All right. Now, the last step that we’re going to do is to persist this data so that it’s going to be available in other notebooks as we’re working through this webinar. The way I’m going to do that is by writing the data to a Delta table. I’m going to take that data frame, and I’m going to use the write and using format Delta with mode overwrite. Then the only thing I need to do here is tell it where. I’m going to do a .save and give it this name right here.
This is where we’re going to save that data. If you’re wondering where that goldpath comes from, that comes from the configuration file that we set up at the beginning of the webinar. While that’s running, we can go ahead and just pop that in here if you want to have a look at it. It’s just a string and it defines a reference to a location in DBFS, the Databricks File System. That is writing right now, and you can see that’s where the data was saved. If you really want to even do something fancier, we can even have a look at it. We can do display dbutils.fs for file system .ls for lists. I’m going to do a list on the file system, and I’m going to list everything at the goldpath and in this location. This should list all of those files there. These are the files that I have written. This is the Delta table. I’m not going to go too far into Delta tables, but basically it means that we’ve written to this location using parquet and we have a Delta log. This is a best practice for persisting data beyond a specific notebook.
In this notebook, we will do exploratory data analysis on the aggregate sample. This EDA will be done primarily through SciPy based visualization tools available as part of the Databricks Machine Learning Runtime. To use these SciPy based tools, we will need to convert from a Spark data frame to a pandas data frame. Luckily, there’s a method to pandas that does just this. We’re just trucking along here. I am going to go back to the workspace. Just a little note here, I generally, when I’m working in a project, I try not to click home. If I click home, I’m going to go back my home. But workspace is going to take me right back to where I’ve been working. I mean, it’s a few extra clicks, but I am lazy, so I don’t like to make those extra clicks. I’m going to open up the explorer aggregate data notebook here. Again, I am going to attach to the cluster where I have been working. Run that configuration again. If you haven’t done that yet, you’re not going to be able to work through this.
But I doubt that you would have made it this far without having updated the configuration file. But just FYI. All right. In the last notebook, we wrote this sample aggregate data that we created. We wrote it to a Delta table, and now we’re going to actually load it back into this notebook. To do that, we have to do two things. One, we have to use Spark.read to read that Delta table. The other is, we’re going to use the topandas method to convert that Spark data frame as a pandas data frame. Notice I say the word load here, because remember a Spark data frame is a reference to the data, but a pandas data frame is actually data in memory. Let’s see. Let’s do .load and provide it with this location right here. You know what? While we’re doing that, it’s worth thinking about when we would want to use Spark and when we would want to use pandas. Pandas is perfectly fine to use on this small sample, this 3% sample of the data that we’re working with. We would probably not want to use pandas on all of the data.
Pandas is going to be single node data work. Whereas Spark is designed for distributed processing. Spark can handle all the data. If it was going too slow, we can just add more machines to our cluster to speed up the process. This is part of … That’s not to say that pandas isn’t useful. Pandas is very useful. It’s just, it’s going to be appropriate for smaller data. Which is one of the reasons why we’re working on a sample. Let’s go ahead and load that data frame in. It’s loading in right now. We have that loaded. Next we’re going to load the different SciPy libraries that we’re going to be using. These are the classics. The classic SciPy libraries that no doubt you are all familiar with, that plotlib, numpy, pandas, seaborn, all the good stuff. You may not be as familiar with seaborn, but hopefully you will get a appreciation for it from this notebook because we’re going to use seaborn a little bit here. The first thing, let’s have a look at the unique lifestyles.
Those are actually going to be available as the lifestyle column in the pandas data frame. One thing I also want to point out here, when we did this with Spark, we used the distinct operator. With pandas, we’re going to use the unique operator. It does the same thing, just a slight difference there which is worth pointing out. There are the three lifestyles that we’re going to be working with. The next thing we want to do is actually split the data we’re working with into features and targets. The features are going to be all of the numerical columns in the data frame. Target is going to be this column, lifestyle. If I was to do something like this, put this data frame here is pandas data frame here and type dtypes, that actually is going to give us the columns in the data frame with the types associated with each column. When I go here, health tracker sample aggregate pandas data frame, select dtypes and exclude the objects, it’s going to select all of the columns excluding those that have type objects.
This is going to give me just the numerical columns and that’ll be my features. Then the target is going to be another data frame that’s just the lifestyle column. I’m going to make a copy of that for pandas specific reasons that I’m not going to go into right now. Here we are. If we’re going to generate our first visualization to do EDA on the features, we want to use seaborn to display a pair plot of our features. This is actually pretty simple, actually very simple. Which is one of the reasons why I like seaborn. Remember I told you that I think if you were familiar with it, that you’re going to like it. Well, this is why. This is a pretty straightforward command and the plot that it’s going to generate for us is a pretty nice plot. It’s great to have this available behind such a simple API, which is the advantage of working with seaborn. The diagonal here is actually a distribution plot of each of the features.
We’re going to look at those in more detail using actual distribution plots in just a moment. But this gives us a sense of just how the features are distributed against each other. I mean, this is really clear right here. We can see that there is a very strong linear relationship between mean resting heart rate and mean VO2 max. We can see that. Then obviously everything above the diagonal is a mirror of everything below the diagonal. We really only need to look at one set of data. This looks like it also has mean active heart rate and mean resting heart rate also has a linear relationship. I should probably point out that there is a negative linear relationship between mean VO2 max and mean resting heart rate. A high VO2 max would correspond to a low resting heart rate, which would make sense. VO2 refers to how much oxygen you have available to you while you’re exercising. If you have a lot of oxygen available to you while you’re exercising, you’re probably someone who has a low resting heart rate. it make sense that they have a negative linear relationship.
Then a positive linear relationship between active heart rate and resting heart rate. That also makes sense. BMI is a little bit all over the place. That’s something that could present a challenge to us. If we look at BMI versus each of the other three features, it’s all over the place. Maybe if you look at this distributed by lifestyle, maybe we get some more insight there. But just looking at it at this level, BMI, it could be a tricky feature. Next we’re going to use seaborn again to display a distribution plot for each feature. I’ve written a little bit of boilerplate code for you here because this one is going to be a little bit tricky. There’s not much to do here. You just need to use distplot in order to get that distribution plot here. Let’s go ahead and run that. I said we were going to look at the distributions in more detail. Well, here we are. We’re using the seaborn distplot API to generate these distribution plots. Especially mean BMI is very normal, active heart rate, pretty normal.
We got a little bump there. Resting heart rate, some interesting behavior here. I got to tell you, this is good to look at. But what I’m really interested in is I want to look at these distribution plots, but disaggregated by the different lifestyles that we’re looking at. Let’s go ahead and run the the same code here. But now I want to look at it against each lifestyle. We’re going to use mapplotlib to generate a series of subplots. We want four of them. Then we’re going to actually enumerate over each of the features. This is essentially enumerate as a Python function that is going to give us each of the features one at a time in a for loop and also give us an index associated with each one of those. We’re actually going to use that index to tell mapplotlib which subplot we want each feature to go on. Ax is going to be references each of our subplots. Then the first one goes in the first sub plot, the second one the second, and so on and so forth. Then we’re going to iterate over each of the lifestyles.
If you recall earlier, we actually define this list lifestyles to be the unique lifestyles, the three lifestyles, weight trainer, cardio trainer, and sedentary. We’re going to iterate over each of those lifestyles. What we’re going to do, is we’re going to take that target data frame and this lifestyle column there. You know what? Let me be consistent and use only this … You may know pandas has two different ways to reference data. We’re going to be consistent and use the same one. We’re going to look at that target lifestyle column. Where that target lifestyle column is equal to the current value of lifestyle as we’re going through the loop, I want the features associated with that person. This is going to give us a subset of the data for each iteration through the loop. It’ll give us just the weightlifters, just the cardio trainers, just the sedentary. Then we’re going to use seaborn again. We’re going to generate a distribution plot which is distplot, is the keyword there.
We’re going to generate a distribution plot where we look at the subset, but just the feature we’re interested in. Put it on the subplot where we want it to go and then label it with that word lifestyle. All right. Let’s go ahead and run that. Here we go. This is what I’m looking for. This is really helping me out. Because now I can tell that we do actually have a significant differences between each of the three lifestyles against each of these features. I mean, look at this. Resting heart rate, not only is it significantly higher for those who have a sedentary lifestyle, there’s also a much wider range of resting heart rates for those with sedentary lifestyle. The cardio trainers have a market difference between them. I mean, think about it, even if you took off sedentary here and just compare weight trainers to cardio trainers, the cardio trainers would have a significantly lower resting heart rate and a significantly higher VO2 max.
That’s actually going to help us. If we’re able to look at these, resting heart rate and VO2 max, that might actually help us to pull out those cardio trainers when we build this model. It’s a pretty … Another interesting thing here, is noticed that we did notice that there was a linear relationship between active heart rate and resting heart rate. Well, notice there’s a similar pattern between the three different weight groups. I’m sorry, lifestyle groups. They have a very similar pattern. The same thing with VO2 max, but flipped. Which makes sense because of the negative relationship between the resting heart rate and the VO2 max. Then BMI, BMI remains tricky. It does actually have the same sort of pattern that we’ve been looking at, but it’s [inaudible] cluster. Or I may not use the word cluster, insider overlap between the different distributions. Especially between the weigh trainers and the sedentary folks. If you think of this, it makes sense, because weight trainers are actually trying to put on mass frequently. They would have higher BMIs closer to those who have a more sedentary lifestyle.
The last thing that … I’m gathering information from this. I’ve got this idea that I’ve got these linear relationships here between the different features. Linear between resting and VO2, linear between resting and active, linear between active and VO2. I’ve got these strong linear relationships between the different features, and then they exhibit similar patterns in terms of when I disaggregate the distributions. One thing I might be interested in there is exploring a correlation plot for those features. Let’s not say things like, they look the same. Let’s actually get some numerical measurements for the different features and how they correlate. This is a tricky little block of code. I’ve had this one for a little while. I picked it up … I probably read this on Stack Overflow several years ago and just have dragged it around with me everywhere I go. But it’s a nice visualization for correlation between the different features. I am going to generate a heat map, a seaborn heat map using the correlation generated using features.corr, but masked with this zeros mask that I’m generating.
What that’s going to do is actually remove the mirrored nature of the correlation heat map. I really am only interested in … I’m not even interested in the diagonal. The diagonal is going to be a one-to-one and then above the diagonal will just be a flip of everything below. I really just want to see this. Indeed look at this, very strong. Very strong negative correlation between resting heart rate and VO2 max. Also very strong negative correlation between active heart rate and VO2 max. Then strong positive between active and resting. What this is telling me is that I may not need all of these features in order to do my classification. One thing that I am going to in general, want to prioritize, especially thinking about this bias variance tradeoff is, I’m going to want to prioritize the simplest possible model that has the best bias. That sort of this tradeoff. One way to think about making that simple model is, if I can identify this strong correlations, maybe I don’t need all of these features to build my model. We’ll get to that later.
The last thing I want to do when I’m doing this EDA is actually, it’s hard to visualize data that’s in more than three dimensions. In fact, let’s even take that a step further. It’s hard to visualize data that’s in more than two dimensions. Maybe it’s impossible to visualize data that’s more than three. I remember I had a physics teacher in school that he would say, “How do you plot something that’s in more than three dimensions?” Well, you plot it in three dimensions and then you write r4 above it. We’re not going to do that. What we’re going to do instead is, we’re going to create a two-dimensional projection of the data that we have using this tool called t-SNE. I am not going into the math behind how t-SNE works. If you’re interested, there’s a link here that you can go read about. Essentially what we’re doing is we’re going to use it for dimensionality reduction in order to project our four-dimensional feature space into a two-dimensional space. Then we can plot that.
We are going to instantiate the t-SNE class from sklearn manifold. We’re seeking two components because we want a two-dimensional projection. We do a fit transform on the features which gives us features in two dimensions. We are going to create a pandas data frame from that. Now we are going to plot those features in two dimensions labeled by lifestyle. I’m creating a two pole here of the three colors I’m working with blue, orange, green, a single matplotlib plot. Then for each color and lifestyle in this zipped list I’m creating, I’m matching up … If you’re not familiar with what this zip does, I will show it to you real fast. It’s just going to zip those up. I got to run this first. That’s going to, here we go, zip those up to associated color with each of the lifestyles. I’m going to run that, and then I’m going to filter the data frame much like I did before, to give me just the rows for each lifestyle associated with the color. Let’s go ahead and plot that. Here we go. What is this telling me here?
What I’m seeing here is, while we do have some bleed between the different groups, there’s pretty clear separation between the three groups. What this is telling me is that I am probably going to be able to use a linear model to classify users using this data. We are going to have clear linear separation boundaries between the data. That is it for this notebook. In this notebook, we will test a simple linear classification model on the aggregate sample. We will also use a helper function to visualize the decision boundary generated by the linear model. All right. Let’s get into basic classification. At this point, I have taken the importing of the SciPy libraries and actually some other nice utility functions, and I’ve put it into a file called includes/preprocessing. I’m not going to be showing that in the webinar, but of course it’s included. I would encourage you to have a look at it and see what’s going on in there. We are going to make use of this scatter plot with decision boundary function in this notebook. All right, let’s get into it.
Again, just like as we did in the previous, we are going to create our features and our target. Actually, the one thing I should point out is that this preprocessing step is also going to load the data from that Delta table and convert it into a pandas data frame. You’re not going to have to do that any longer, it’s just being done for you. We’re going to create the feature and target objects. One thing we’re going to do here is numerically encode the target. We’re going to be working with a library called scikit-learn to do a lot of the machine learning work that we’re doing here. Actually all of the machine learning work that we’re doing in this particular part of the webinar series. Scikit-learn does not like a string encoded target. It’s going to need a numerically encoded target. They can still do classification on that, but it’s going to need everything passed to it as numbers. In order to do that, we’re actually going to label encode the target vector. As of right now, the target looks like this. It is strings and we’re going to want it to be numbers.
We’re going to do that right here with this preprocessing plus the label encoder. We’re going to instantiate a label encoder and then we’re going to use it to fit and transform the target. We are going to put the target lifestyle in here. Now, if I look at the target, you should see that I have both of them, both columns there. I’ve got each lifestyle and then it’s numerical encoding right next to it. The first, we are going to fit a series of linear models here using logistic regression. Which is the linear model that you use for classification. As I mentioned previously, we looked at this t-SNE data and we saw that there was pretty clear … Or I should say the t-SNE projection of the data, and there were clear boundaries between the different classes. Let’s go ahead and just pass that to a linear model first and see how well it does. I’m just running that t-SNE operation one more time. The next thing I’m going to do is split the two-dimensional data, the t-SNE projection, into training and testing sets.
This is just a best practice, and I’m certain that most of you probably heard of this before. It’s a best practice when you’re doing machine learning, you should train your model on one set and test it on another. You’re testing the model on data that it has not seen in order to assess the quality of the model when it doesn’t see the data. A standard best practice. We’re doing it here on the t-SNE data. Then we’re going to do exactly that. We’re going to train the data, we’re got to fit it on this training data, and then score it on this testing data. Let’s see how well it does. I’m passing penalty equals none here to just do a straight OLS logistic regression. You can see it does extremely well. It gets almost a 96% accuracy predicting the classes using this t-SNE data. The next thing we’re going to do is take the results of this fit. We’ve got this logistic regression model that we fit. Take the two-dimensional features, take the target, and we’re going to pass it to this helper function scatter plot with decision boundary.
Just as we did in the previous notebook, where we plotted the t-SNE data, the two-dimensional data labeled by lifestyle, we’re actually going to do that. But now we’re adding an additional wrinkle here, which is, we’re actually plotting the decision boundaries generated by this logistic regression model. You can see that as we suspected, there are clear decision boundaries between each of the classes and this logistic regression does very well classifying on this t-SNE data. Our work is done. All right, everybody let’s go home. Obviously I’m kidding. The t-SNE data works very well on this very small sample of the data. It is probably going to be, if we wanted to let it run for weeks on end and engineer a very specific approach to generating the model, we could fit it on all of the data. But that is probably not going to be feasible. In general, I think with a little bit more work here, we can develop a model that is simpler than t-SNE that is going to work on all of the data. Well, what we’re going to do next is, let’s start with all of the sample data.
What we’re going to do is, is we’re going to try and fit the same model, see how well it does on all of the sample data. I’m going to do a train test split on this. Let me go ahead and just put that in here. Again, using my super fancy multi cursor techniques. I might be moving a little bit quicker but I’ll hang out for just a moment. Then we’re going to fit a logistic regression model to the data, and there’s a standard pattern that we’re going to use when we’re doing this. We’re going to fit the model in the training data. We’re going to score the model on the testing data. Fit on the training, score on the test. Fit on train, score on test. Let’s go ahead and do that. We’re going to do lr, which is the name we’ve given our instantiated logistic regression. lr.fit features. … Or features train and target train. T hen lr.score features test and target test. There we go. Let’s run that. We have an error. Also, our accuracy is significantly lower. What we’re seeing here is that we have a convergence warning.
Lbfgs is the the internal library that this logistic regression is using to do the fit, and it failed to converge. Which is why our accuracy is much lower when we used all of the data. Let’s do this. Let’s update the logistic regression model we’re using to increase the number of iterations that it’s allowed to do. We’re going to increase this to 10,000. You could also do 1e4 if you want it to be fancy. Same pattern, fit on the train, score on the test. If you want to be lazy, you can just do that. There we go. We are back to almost a 96% accuracy, but it required more iterations to get there. What this is telling me is that we’re going to do well with linear models on this data. But in the subsequent notebooks, we will take a look at that just a little bit. In this notebook, we will discuss the bias variance tradeoff, design and experiment to measure bias and variance using the bootstrap and run it against combinations of features using MLflow.
When we talk about the bias variance tradeoff, what we’re interested in is the uncertainty that’s associated with a particular classification model. When measuring this uncertainty, we typically think about the bias associated with a model, how well it performs the classification, and the variance of that model, how much the model will differ if we use different training data to fit it. We typically think of the tradeoff in terms of model complexity. A model that has two features, say BMI and resting heart rate, is likely to perform classification better than one that just uses resting heart rate. But such a model is more complex and likely to have greater variance with different training data. You can imagine a simpler model, just one feature, it’s going to have a higher bias than a model that has more features is going to have lower bias. It’s going to be better at capturing the underlying phenomena, but will have greater variance with new data that’s being fed in. An optimal model will simultaneously minimize both. This is the trick. This is the hard part of the work that we’re doing.
In this notebook, we’re going to examine many different models for predicting our target. Each of the models that we’re going to look at is going to use a different subset of the features. We have four features, mean resting heart rate, mean active heart rate, mean BMI and mean VO2 max. We’re going to try out models that use one feature, models that use two features, models that use three, and models that use four and all of the different combinations therein. It’s going to be a total of 15 models. We’ll use the estimated bias and variance of each of these models to assess which model or models are likely to be the optimal model. We’ll also consider the complexity of each model relative to this estimated bias variance. How are we going to do this? The bootstrap is a method for estimating uncertainty. Here, what we want to estimate is bias and variance. The method involves generating a series of subsample sets, sampling with replacement from the original data set. We’ll then fit a particular model under examination against each of the bootstrap subsample sets.
The accuracy mean across all the models fit to each subsample will be used to estimate the bias and the accuracy standard deviation will be used to estimate the variance. Let’s get into the notebook. Attach to the cluster, run our configuration and preprocessing notebook. Here’s where we’re going to generate our bootstrap samples. This is a function that we’ve written. It is going to generate samples evenly across each of the three lifestyles. Here’s what a sample looks like when generated. We’ve got five instances from each of the three lifestyles. Here, we’re going to actually use some Python to do this. We’re just using a for loop to generate that set. But you know what? Really, what we should be doing is a list comprehension in this case. I’m leaving that for you all as a challenge. I’m just going to grab this function right here. All we’re doing here is just doing it in a list comprehension instead of using a for loop. This reads a little bit better, so the preferred method. Just to verify that everything is working correctly, let’s display the number of samples in each subsample set.
We’re going to use the length built in function on each sample set in our subsample sets. It looks everything is good. We’ve got 10 sets of size 15. Let’s have a look at the second one. (silence)
There we go. Same there. They’re looking good. Looking likely we’ve got what we’re looking for. Recall as previously we’re going to need to label encode these. If you notice, they do have the lifestyle encoded as a string. We’re actually going to need to label encode each of those. We’re going to fit the label encoder on the original target, the health tracker the aggregate sample, the lifestyle column. Let’s do this. We are going to encode each subsample data frame one at a time using the for loop. We’ll just do an le transform on the sample set lifestyle column. That is done. Now we’re going to design this experiment that we’re going to be conducting. This is going to consist of several steps. First, we need to build the subsets of features. We’re going to do just one to start with. The first one we’re going to do is a single model, a model consisting of a single feature, mean active heart rate. It’s a one feature model it’s just using mean active heart rate. We’re going to build our experimental data subsets by passing in just that list of features to the sample set.
The sample sets at this point in the experimental data subsets are going to be data frames that have a single column, this mean active heart rate column. Then the targets are going to be just the lifestyle encoded column. Let’s go ahead and run that. If I display that, let’s have a look at just one of those, just so we can get a sense of what we’re working with here. We’ll look at the second one again. You can see it’s a data frame with just a single column. These are for one run of the experiment, we have a single column. Then we are going to instantiate this grid search model. We’ve got this grid search CV. We’re passing in a decision tree classifier. I actually mentioned we were going to look at some other families. Here, we’re actually looking at the decision tree family of models. Different from the linear model that we’ve been looking at, but don’t worry our linear friend will come back. This grid search model, we’re actually using it here just for its cross validation purposes.
We’re passing in an empty grid search for the grid search parameters. We’re only conducting a single run. What we’re going to be doing here is, the cross validation is going to run five times. Each time, it’s going to leave out one for each lifestyle and then use that as the test. We’re sort of doing not quite a leave one out. We’re doing a leave one out validation. But we were leaving one out for each lifestyle at each row. We need to fill this in. We need to fit the grid search model. The way we’re going to do this, is we’re going to fit the model on the features and the target. Let’s go ahead and run that. You can see this is going to go through, this is going to grab each of the experimental data subsets. There are 10 of them and the associated targets. It’s going to fit each of them using this leave one out decision tree fitting process. The score is already computed as part of this cross validation process. We’re going to get a mean test score for each of the 10 runs.
We’re appending that to this experimental scores list. I mean, here, let’s take a look at it before we move on. We’re going to wind up with this experimental scores result. These are the results of each of those 10 runs. We fit a decision tree model for each of those 10 experimental data sets using this grid search cross validation process. Here, let’s go ahead and have a look at … This is going to show us the results from just the last iteration. You can see we have quite a few results we can look at. It did five splits. These are the results on each of those splits. Then we take the mean, and that is this tour that we’re recording. Let’s display the results. The feature subset was we used just a single feature this time and the mean test score across all of our 10 bootstrap samples was a 0.69. The standard deviation was a 0.12. We can take this as a standing … The mean accuracy can be used as a standing for bias, and the standard deviation can be used as a standing for variance.
What we’re going to do with this, is we’re going to throw it into an MLflow experiment runner. I have to find an experiment runner here. Each of these steps we’ve described above, I’m putting into this experiment runner. We’ve got to helper function. This is the helper function to run the experiment. First, we’re going to build the subsets of features. That’s happening right here. It’s going to take in feature subset as an argument and pass that in to generate these sample sets. We’re going to fit on each subset using the cross validation. Here’s that same function we ran. Here, we’re doing it with logistic regression instead of the decision tree. We’re going to do a fit on the features generated in these experimental data subsets. Then finally, we’re going to record the results. We’re going to record the feature subset, the list of features that would pass as a subset, and then the mean score and the standard deviation score. This is this function that we’ve written, this experiment runner. Let’s go ahead and run that.
It looks like it’s done. I don’t know if you’ve noticed this, but over here, now we have a little green dot next to this little beaker bottle right here. This is the result of our MLflow experiment run. You can see the subset was mean active heart rate. We wound up with a mean score of 0.733, and the standard deviation of 0.079. We’re moving along. Now we want to be able to generate all of the feature subsets. My feature columns are going to be available at the original sample aggregate pandas data frame. I’m excluding again, using this exclude object. This is just going to give me the numerical columns. I’m going to use the iteration tools and the combinations function to generate all of the possible combinations of features. Here you go. These are all of the possible feature subsets. You can see I’ve got subsets consisting of a single feature, I’ve got subsets consisting of two features, subsets consisting of three features, and then a single subset consisting of all of the features. All right. We’re going to run these as MLflow experiments using each feature subset. Here we go.
Okay. We are back that command took almost 30 seconds. Which considering we fit 15 models, isn’t too bad. Let’s go ahead and we can refresh this up here to see these results. This is not a terrible way to look at the results, but I actually prefer to use the MLflow API to access the results. Over here, I’m going to use the mLflow.search runs to grab the results. You’ll notice that it gives me the results as a pandas data frame. Which is great because we’ve been using pandas all along. We’ve become hopefully comfortable with working with pandas. What we’re going to do is prepare the results data. The columns that I’m interested in are the metrics mean score, metric standard deviation score, and the params subset. We may, depending on the situation, we wind up with an experiment or two with no values. We’re going to remove those. If you’ve run this more than once, we want to drop the duplicates. These are the results that we get. You can see we’ve got all of our different parameter subsets.
Each one of these is a different model. Each one of these corresponds to a different model. Then these are the scores associated with those models and the standard deviations associated with those models. The last thing I want to do is, I also remember we were interested in the complexity of the model. For measuring the complexity of the model, I’m going to use this column n terms, which is a measure of how many terms were used to build them, how many features were used to build that model. I’m also going to reduce or reverse the mean score. I’m going to change it from an accuracy to an error. I’m just subtracting it from one. These are now … I have my mean score and my standard deviation. I have my number of terms. What’s great about this now, is that both the mean score and the standard deviation score are both in a lower is better situation. Whereas previously we wanted a high mean score and a low standard deviation score. This is we want both low. Lower is better for both of them. Let’s go ahead and plot these results.
You notice the other thing that I’ve done here is I’ve also scaled the size of the point in our plot here by the number of terms. The bigger the point, the more terms went into making this model. What’s interesting to me, I mean, right off the bat, the model with all four terms, it’s not the best model in either sense. If the x-axis here is a standing for bias, this is the mean score, you can see it is definitely not the lowest in that regard. It’s also, actually that’s to be expected, that it has a higher variance because it is a more complex model. In fact, this is very interesting, the model that looks the best, both in definitely in terms of bias, but also it’s comparable to this one in terms of variants on the y-axis, this single feature model. Mean resting heart rate looks to be a very promising model. Imagine that.
We’ve taken this data set with four numerical features, and when we’ve designed this experiment to assess which would be the best, what we find is, using a single feature may in fact be the best model. What’s interesting though, is that this model over here, which is also doing very well in terms of variants, is using two completely different features. That’s interesting. This is also a promising model here. It doesn’t quite achieve the same either in bias or variance, but it is doing very well. It is the same single feature, this mean resting heart rate, but now we’re also using the mean BMI. That’s another model that we might want to take a look at. In this notebook, we will run the bias variance experiment against two families of models to assess which family to use. End of the road here. Let’s get into the last notebook, the results analysis. Run our setup notebooks. Now we’re going to make use of these functions that we are loading in, in the preprocessing notebook. We’ve been seeing these all along.
Now, we’re going to use them. We now have a function called generate feature subsets, that’s just going to do it for us. Generate bootstrap sample, that’s actually baked into the subsample sets. We’re not going to use that one directly. We’re going to run generate subsample sets. We’ve got our experiment runner here and a function retrieve results. Here we go. We’re going to generate the subsample sets, and that is done. We are going to now run the experiment using decision tree classification on each feature subset. Feature subsets is already prepared. This is the same thing we ran from before. We generated those feature subsets right here. Here they are. We’re going to pass each one of those into the experiment runner, and it’s going to use a decision tree classification model. There it goes. Now, as it’s running, you’ll note that it’s actually logging these to Mlflow. We can retrieve the results and display them. These are the results of all of our decision tree models being fit. But I think we had a sense that linear models were going to do particularly well here.
I actually got that sense from this correlation plot that we looked at very early on. I just wanted to bring that back here. If you recall, we had very strong correlation between at least three of the features. Resting heart rate strongly negatively correlated. But thinking about correlation, the negative versus positive correlation, what you’re looking for is features that are strongly correlated regardless of the sign. There’s a very strong correlation between mean resting heart rate, mean VO2 max, mean active heart rate, mean VO2 max, and then that the heart rates strong correlation there as well. This is going to tell me that we’ve got multicollinearity going on for sure. The fact that we had the strong decision boundaries is telling me that linear models are going to do well and we probably don’t need all the features. We’re going to do the same thing that we just did with the decision tree. We’re going to do that down here with the logistic regression. Look, I made you plugged some values in here. This shouldn’t be too bad. We’re just going to plug in feature subset from the for loop right here.
The model we’re looking for is logistic regression. Now, if you recall, we had some issues earlier with the logistic regression not converging. We want to set that max iteration to be 10,000. Then the other thing I’d like to do here is set the penalty to be none. There’s a little got you here. Make sure you use the string none and not the Python keyword none. That’s what the logistic regression model is looking for. Let’s go ahead and run that. Okay. We are back. We can refresh this over here and see if that indeed all of these logistic regression models have come in. We can retrieve the results. I’m going to use that retrieve results function passing in the metrics means score and standard deviation score and the parameters model and subsets. I’ve actually got an additional parameter here, which is the model. Remember we’re using decision tree and logistic regression. I’m going to go ahead right now and just create a column called bias, which is going to be one minus that means score, and a column called variance, which is the square of that standard deviation.
I’m going to drop the mean score on the standard deviation score. I’m going to sort the values in the data frame by the bias column and only display the top 10. There they are. You can see that the decision tree appears to be doing very strongly. Logistic regression is not too far behind. In fact, this logistic regression has actually a lower variance than this model. The best model is a decision tree with, it looks like, two features. You’re going to do the same thing we did before, where we’re going to compute the number of terms, and also this new column here, tradeoff. Where we’re going to compute the square of the bias and add it to the variance. Let’s sort by the tradeoff. There actually, it looks like we’ve had a logistic regression model creep up into a closer competition here. Interesting because it was one of the models with a higher bias, but it had such a low variance that it became competitive. Let’s plot those models by tradeoff and number of terms. These are the top 10 models that we identified and plotted again with the circle being the number of terms.
I mean, you can see it right here, the more terms, the higher the variance and therefore the higher the tradeoff value. That, it looks like mean resting heart rate is a very strong feature. I’m going to guess that when we get to the end of this journey, this is going to be one of our critical features. I mean, I could even imagine a situation where you build a model with just that feature. That brings us to the end of this webinar. The folks that are supporting the webinar will be taking your questions and helping you to clarify any misunderstandings or additional questions that you have. When we return for the next webinar in this series, we will be digging in deeper with these linear models. Especially models that can be used for future selection, like the lasso model. I’m looking forward to seeing you all, then. Thanks for attending.
Sean is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committer and PMC member, and co-author Advanced Analytics with Spark. Previo...
Joseph Bradley works as a Sr. Solutions Architect at Databricks, specializing in Machine Learning, and is an Apache Spark committer and PMC member. Previously, he was a Staff Software Engineer at Data...