Sharon Xu

Senior Data Science Manager, AARP

Past sessions

Summit 2021 Gender Prediction with Databricks AutoML Pipeline

May 28, 2021 11:05 AM PT

As the nation’s leading advocate for people aged 50+, each month AARP conducts thousands of campaigns made up of hundreds of millions of emails, mails and phone calls to over 37 million members and a broader universe of non-members. Missing information on demographics results in less accurate profiling and targeting strategies. For example, there are 1.5 Million active members and 15 Million expired members missing gender information in AARP’s database. The Name gender model is a use case where AARP Data Analytics team utilized the Databricks Lakehouse platform to create a fully automated machine learning model. The Random Forest Classifier used 800 thousand existing distinct first names, ages, and over 700 variables derived from the letter composition of first names to predict gender. It leveraged MLflow to track the accuracy of models and log the metrics overtime, and registered multiple model versions to pick the best for production. Model training and scoring were scheduled and the auto ML pipeline significantly minimized manual working hours after the initial set up. As a result, AARP dramatically (and accurately) improved the coverage of gender information from about 92.5% to 99.5%.

In this session watch:
Sharon Xu, Senior Data Science Manager, AARP