Gender Prediction with Databricks AutoML Pipeline

May 28, 2021 11:05 AM (PT)

Download Slides

As the nation’s leading advocate for people aged 50+, each month AARP conducts thousands of campaigns made up of hundreds of millions of emails, mails and phone calls to over 37 million members and a broader universe of non-members. Missing information on demographics results in less accurate profiling and targeting strategies. For example, there are 1.5 Million active members and 15 Million expired members missing gender information in AARP’s database. The Name gender model is a use case where AARP Data Analytics team utilized the Databricks Lakehouse platform to create a fully automated machine learning model. The Random Forest Classifier used 800 thousand existing distinct first names, ages, and over 700 variables derived from the letter composition of first names to predict gender. It leveraged MLflow to track the accuracy of models and log the metrics overtime, and registered multiple model versions to pick the best for production. Model training and scoring were scheduled and the auto ML pipeline significantly minimized manual working hours after the initial set up. As a result, AARP dramatically (and accurately) improved the coverage of gender information from about 92.5% to 99.5%.

In this session watch:
Sharon Xu, Senior Data Science Manager, AARP

 

Transcript

Sharon Xu: Hello, everyone. Welcome to my session. My name is Sharon Xu. I’m the Senior Advisor for an Advanced Analytics Team at AARP. Here’s another co-speaker, Qing Sunk, she’s a Resident Solution Architect from Databricks. Today, I’m going to talk about the gender prediction model, which AARP built about two years ago and we started to use a Databricks, AutoML Pipeline, and some of the functions to make this whole process automated. So let’s start. So first of all, just give a pretty short introduction for both of us. So I work AARP at Advanced Analytics Team. My major role in this team is to help my internal clients use a machine learning model, a predictive models to target their marketing universe with maximum efficiency, lowest cost, highest responsers. So other than modeling, I also do Incremental analysis and Attributions analysis just to understand each programs, their value, actual value to the organization.
So other than that, all the model driven, data driven analytics work, accounts to our team and accounts to meet. I graduated from University of Maryland with PhD degree in Civil Engineering. My major was transportation demand forecasting. I know that sounds pretty different from what I’m doing now, but I did benefit a lot from the model building experience during the PhD study. Starting from spring this year, I have been quite busy in gardening, but not being very successful. So my ultimate goal this summer is just to coalesce and have some products. That’s it. I will hand over to Qing to introduce herself.

Qing Sun: Hello, everyone. Welcome to Data & AI Summit 2021. My name is Qing. I’m a Resident Solutions Architect who work with Sharon on this project. I’m also a Data Scientist and Machine Learning Expert in the Public Federal Space. So if you are a team and your company is in the process of adopting data like ours and AI, you can reach out to me for help. Today, Sharon and I are going to take you through our exciting machinery journey at AARP and hope our experience will be something helpful to you. Let’s get started.

Sharon Xu: All right. Thank you Qing. So before we kick off, I will like to thank my 2018 summer intern, [Natishe Senhari], so he spent a lot of time on research to get the best methodology for this gender prediction model. Also completed the first draft of the model to have the pretty good performance. So our following work to use AutoML is based off of his work. Also, I would like to stand my manager, Jerry [Weiss], who proposed this interesting idea to build this gender party prediction model. So today we are going to walk through the background of this model and our detail solution as a gender prediction model and also talk about the AutoML pipeline we used to apply on this model, and of course the learnings and the future.
So first of all, a bit intro about AARP. So AARP has 38 million members, so maybe some of you or your parents or grandparents are members. So we have been championing as a positive social change and delivering the values to our 50 plus members and also to encourage people to choose how they leave when they age. So we can see there, a bunch of pictures my coworker helped me to select here. So this covers a variety of different fun activities AARP brought to our members. Yeah. So talking about models, AARP has about over 150 predictive models for different usages. So the major model scope includes these four different pillars. The first one is based out of the membership. So we use models to acquire… to recruit members and also to identify which members are most likely to renew or not to renew.
So those acquisition models are through different channels, like Direct Mail, Alt Media or Lead Gen, and we also have Renewal Models. So the second pillar of the models is for identified people who are most likely to respond through different channels… phone channel, email channel, online channel, the start pillar is a biggest, it covers a lot of AARP activities, campaigns… all these activities we target our members by using models, including adults… the foundation campaigns. Last time we had fun events like movies or perhaps festivals, dancing parties or something like fraud or driver’s safety. Yeah. A lot of online programs as well. The last pillar is about demographics. The model is used to identify people’s demographics. Yeah. Also the imputation. In our database, there are some information missing… we do really build models to impute those missing values, such as the gender prediction model I’m going to talk about. So this is our big scope of AARP predictive models.
So why do we need gender model? So in our AARP targeting audience universe, there are about 2 million records they mis-gender information. So is this missing information, gender information and will cause a less accurate profiling and also less accurate targeting strategies. So that’s some… that’s caused some issue, so we need to build a model to impute those missing values. Other than that, this is also pretty straightforward use case for us to test a few functions on Databricks, like MLflow or Model Registry and the entire automated pipeline, just to see how it works… some prediction models. Yeah… to simplify it speaking, gender prediction model used Random Forest Classifier to identify gender by using people’s names and age. So only from names, there are a lot of variables derived from the name… by people’s names… alphabetic order of all the letters.
Here’s a detailed example. So in our one name feature, it can derive about 731 different features. So how this happened? For example, the first one… the first portion is about the frequency of letters. I use my name and Qing’s name and I say, “Example.” So Sharon has only one, “a,” has one, “h,” and one, “n.” So we’re counting the frequency of each letter from people’s names. The second part is a sum up position of the letters. So we look at the position of each letters. For example, Sharon’s, “a,” is at position three of the name and, “h,” is that the position two and, “n,” is that a position six. So is there same letters appear in the name, we just sum all this letters position together.
Then the third part is a biggest block is a bigram. It’s 26 by 26 variables. So we detect every single bigram of people’s names. Sharon, we detect a S-H, H-A, A-R, R-O, O-N. So these bigram combinations will all fly out as one. Then the last part is a second to last character position. It’s about names’ last two letters, alphabetic position. For example, Sharon’s, “o,” is that position 15 of the alphabetic order and is at position 14. And we have the length of the name. So Sharon has six letters, and then it’s an, “h,” because there are a total… over 700 features derived.
So not just a Random Forest Model, we used a variety of different models. The Random Forest was a final leaner. So it displays about 76 accuracy of the gender prediction on purely new distinct names. That means that if the names is not existed in our model printing data set, we can have 76 accuracy, if we have the normal data set, which includes some existing names in the training data set, and some new names, the model can give us about 90% accuracy. Yes. A model so far can only be used to predict the binary gender, because in our database we don’t have enough non binary data.
So next I’m going to walk through the AutoML pipeline used on this gender prediction model. This is pretty standard machine learning model per say. So all our data is saved in AWS and Databricks is built on top of that. We pull the data from our database, pulling people’s name, gender, and age. But the data wasn’t that claimed. A lot of names has a space or has a random size when you clean them up. It’s easy and it was about 800,000 distinct name gender combination. Yeah. So with a clean data set or use a methodology I described the previous slide to derive about over 700 features from names and saved in the Delta format. So data table definitely give us that faster running speed we’ll have falling steps.
Then we have the model… very typical model building and training process. We use the great search to look for the best parameter combination. So there were out 18 different model routes, which took a lot of time. The MLflow helped to track all this model around result and evaluation metrics, the easy and the best model was registered by using the model registry function. Once the best model is saved, the model is ready for deployment, we just note the model for our model registry and the score on the new dataset. Once everything is developed well, so the model is pushed into production, we scheduled a job to make the model retrain itself and also regularly score on the new data set in the future. So after this scheduled jobs down, the whole process is complete hands-free.
So next I’m going to just show you some screenshot of our tables or codes, just give the audience a better idea about what model that could look like. But here this slide shows that cleaning does that… after cleaning all the names, we have the… after future engineering step, we have all the 700 futures derived from people’s names, ready for model building. Yeah. In the model building, this is what I really like, SPARK coding. So it… you can see, we use a pipeline function to feed future assembler and the Random Forest Model into the pipeline function to wrap up together, then pipeline served to cross validation and evaluation metrics set up to cross validation and also probably during tuning steps. So all the steps are rubbed together, which is where we’re well organized and very clear. Personally, I think it looks better than spaghetti code… I saw. Yeah, as I said, we have 18 different iterations. So all these parameter tuning steps were locked by MLflow to keep the accuracy results, and a best model was chosen.
We also logged the best models, performance and metrics like parameter tuning results from the best model and all this evaporation metrics. The best model was… in the bottom of the screenshot, you can see the best model was saved by bottle registry function and the best model was pushed into production and it’s ready. Okay. So once is air staying, it’s set up, we scheduled a job… so this is a screenshot of the scoring notebook. So the scoring process is scored… is scheduled on the 10th of every month. On the 10th of every month, the model will automatically score the new data set and send the data to our users.
Okay. So here is the entire process I described. So from the starting of gender prediction model, this is what I learned. So first of all, the entire process has been much faster. So initially, our intern use a Py style to work out the parameter tuning and monitor training. Sometimes we had an out of memory issue, but by using Databricks in the PySPARK, we definitely have less worry about the large data set and the tuning… different tuning integrations because of the customized clusters and parallelism settings and also Delta helped entire process. Also the MLflow just easily track all the model results and the model registry just simplified the model management work. So we can better organized the models. My favorite part is self-training and scoring. So as a model wants you to schedule the jobs, the model can retrain itself, can score by itself without any human intervention.
So, yeah… So I barely need to do anything at the end. What I’m learning here made me think about the rest of AARP models. So we have hundreds of other predictive models… can we just apply the similar methodology to the other models to make the other models also automatically trend and score? It makes a lot easier. I think we can. So first of all, we are working on the integrated data platform. So AARP has been working on this for the past two years. An integrated data platform can definitely minimize the flat file intake and also avoid moving data set across different platforms, which is a pretty annoying manual steps. Also I would encourage my modeling team to use MLflow and model registry to well organize the model results and the storage to store the models systematically. As a last but not the least, so when building models, we need to think about to streamline the entire process, to streamline our recoding.
So once the structure is set up, the model trainings, flooring deployment, everything’s set up, we just scheduled the job to let the retrain and the scoring process to run itself regularly. So this is a fully automation work, which is ideal… automate like the goal we want to reach. Yeah. So once everything’s automated, I guess my daily life is just to grab a cup of coffee, just to look at the models are all taken care of by themselves. That’s the ideal work. But once I finish writing the last bullet is that make think about it, What if all the models are taking care of by themselves? Am I going to lose my job? So yeah. Qing, what do you think?

Qing Sun: That’s a great question, Sharon. So, while I’ll say your core values of adopting Databrick platform, in my opinion is increased productivity and that easiness of use, and we made it really easy for you to scale assets. So when you need to retrain say 100 models, you can leverage database jobs, API and Scripts to create automated jobs for training and the scoring and leverage the same MLflow process to track your experiments, register and promote your models. And those processes can be perfectly combined with your existing CICD pipeline. So even when you have hundreds of the models AARP production, it’s no big deal. Databrick’s platform can easily support that. And we make the machine learning development life cycle really easy for everyone. So the data scientists can really focus on developing more models and bringing more value to your business. So that’s a huge win. Great job, Sharon. That’s a fantastic story you share today. Let’s wrap it up.

Sharon Xu: Thank you, Qing. So that’s it. Please send us all your feedback, all your questions, and don’t forget to read our review session. Thank you for the audience. Thank you for your time.