How do you connect the effectiveness of your ad spend towards driving sales? Introducing the Sales Forecasting and Advertising Attribution Solution Accelerator. Whether you’re an ad agency or in-house marketing analytics team, this solution accelerator allows you to easily incorporate campaign data from a variety of historical and current sources — whether streaming digital or batch TV, OOH, print, and direct mail — to see how these drive sales at a local level and forecast future performance. Normally attribution can be a fairly expensive process, particularly when running attribution against constantly updating datasets. This session will demonstrate how Databricks facilitates the multi-stage Delta Lake transformation, machine learning, and visualization of campaign data to provide actionable insights on a daily basis.
Afterwards, M&E specialist SA Layla Yang will be available to answer questions about this solution or any other media, ad tech, or marketing analytics questions you may have.
Speaker: Layla Yang
– Hi everyone, welcome, joining us today. A quick introduction of myself. My name is Lily Young. I am a solution architect at Databricks, based in New York. I mostly focus on data science and machine learning. Before Databricks, previously I was data scientist in the advertising technology industry. I did lots of works on building machine learning models and data product including attribution models, which we’re going to talk today. Real time bidding prediction algorithm and audience segmentation. Today I’m excited to share a use case of Delta Lake and machine learning on Databricks with you. The business contacts for today’s discussion is that on advertising, connecting the effectiveness of advertising spends with sales or conversions is very important. Say if you are a marketing analyst or your working at an agency. You’re likely familiar with attribution model. Which ties advertising campaigns to your various APIs, for example sales or conversions. However, a traditional attribution model for example, Marketing Mix Model, can be a fairly expensive process especially when you need to say collect data from various resources as upstream to build and refresh attribution model against a constantly updating data set. And fortunately Databricks provides a unified data analytics platform with Delta Lake, which is an open source ACID transaction layer for your cloud data lake to empower the largest scale data engineering and data science. In this demo, we have two notebooks. We will walk through an overview how simple it is to build an end-to-end workflow on Databricks platform. From loading the data and using Spark to do some transformation, using Delta Lake as an optimization layer creating bronze, silver, gold, three stages of the Delta table. And ultimately feeding the data to train a media attribution model and effectively interpret the results as the last step. The data set I use here is the foot traffic data from SafeGraph which aggregates the in-store number per visits. My goal is to use Databricks to read in different online fast food restaurants in-store visit data. Explore the data in Pyspark and Spark SQL, transform the data into clean reliable form to make it an analytics ready for downstream machine learning tasks. So as an advertiser, I want to find out which of the online media channel are the most effective channel, driving in-store visit over New York City fast food restaurant chain. So lets get started. First step is to load the data from the block storage. So I pre-ran S3 bucket to Databricks file system, so that we can directly load the raw CSV file from my block storage to Databricks. And then here I create a temp view which allows me to directly interact with those files using Spark SQL. So this is the raw data table we are using. It looks like the in-store visit here, you can see here, is a big array. So since like I will have some feature engineering to do later on to do some transformation, And we have region here, city, location, so on and so forth. Then, I feel pretty good about this raw data set, it looks not that dirty. We can write out this data set using the data format here, to create a Delta Lake bronze table to capture all the raw data. The advantage of using Delta Lake format is that it provides action reliability in terms of performance optimization on top of your Delta table. Here I create a bronze table pointing to this location so that I can share with my teammates. Once we get the raw data in, we have some cleaning and feature engineering work to do. So for example I want to add an additional feature MSA region and I want to do some pausing to get the month and year into the data frame. And then I need to explode array, the big array visit_by_day into several rows. So this is a block of function I just use to flatten the array. After I run this code block, it returns the one data frame visits_by_day with number visits mapped into each block. So those are the target variables we’re gonna use. I’m pretty happy after the feature engineering. So I persist this cleaner data to Delta Lake. And as this silver layer, the silver table so that everyone on the team, from the data science team can access the data directly and do exploration from there. At this point after the silver stage of the data table, I’m ready to enrich the data set with my online campaign media data. So for instance, I have a banner impression, social media Facebook Like, and web landing page visits. And they easily join different streams of data to my original data frame. I also call the Google Trend API to put in the Google Trend keywords index to represent organic search element. So this data is from Google Trends site. Finally I produce a data set including the in-store number of visits and the online media data. We see all the data coming together in this data frame. I can quickly derive inside by just plotting the number of visit times series data using the building chart. Looking good here. Finally at the end of the ETR pipeline, we write our disc clean enriched data set to Delta Lake and create a Gold table on top of it for my downstream machine learning task as the next step. So until now, we’ve finished the ETR path. We us Databricks to ingest and combine all raw data. We also thought clean transform an actual reliability to the data by writing to Delta Lake for faster performance. So at this point I feel pretty good to use Delta Gold table. Now it’s time to create attribution model, moving from the ETR upstream to here the brain part. In this notebook, we’re going to create Delta Gold table, look closely at field traffic in New York City to understand how the fast food restaurant, various advertising campaign efforts, drove the in-store visits. So the goal of this workflow is, One, to create machine learning approach, that given we have a set of online media data, we want to predict the in-store model visits. And second, once we have the model, we want to leverage SHAP model interpreter to de-composite the model prediction and quantify how much of my KPI, our KPI’s were driven by a certain media channel. So the first few steps is one, to load the saved conda environment for this machine learning task and very similarly to reading the data by creating a data set pointing to the Foot Traffic Gold table. Then as a data scientist, I want to quickly check the distribution of my target variable Y, the number of visit and against a feature. We did meet the average assumptions, right? So we can quickly identify from this chart. This chart I plot, and I include all the New York state data. Its seems like there are two peaks here indicating multi mode of distribution. This may be due to the underlying differences of population in different regions. So when we drill down and separate New York City traffic from all the other cities. The distribution looks close to normal I would say. New York City must be a unique region here. As a data scientist I love using Pandas. Here I can quickly convert a Spark data frame to Pandas data frame. And to use use Plotly lib to plot charts in the notebook to explore my data. So those charts are from plotly and you can zoom in and zoom out and drill down to a specific data points to explore the data. That was quite for handy data. So those time series data looks good. This is the Google Trend. As you can see, I can easily create all the statistics plots I need without leaving the same Databricks environment. Then, we like to check the distribution of all the features using Q-Q plots. So those are the features that I put in and this is to Q-Q plots, it looks good, it looks quite normal, normally distributed. There is a probability curve here, so by now I’m now confident that the data is suitable for model training. So lets train a prediction model. For the algorithm choice I decided to use XGBoost. The data set is in my head so single note training will get things done quickly. However building attribution model has always been a very irritating process, right? Traditionally statisticians have to run hundreds of model to find one best model which is not the most efficient way to do modeling. Here using Databricks, I can leverage Databricks runtime, machine learning runtime, all to a max capacity. We have the hyperopt viewed in by specifying the Spark trial here just one line keywords change here. The tuning job is automatically distributed to find the optimal hyper-parameter for the model. So after the hyper-parameter, the best set of a hyper-parameter was found by Hyperopt. I only need to fit the model once here in this cell, and then I can use the fitting model to predict New York City’s Subway in-store traffic. So you can see the red lines here are my predictions, and the blue lines here is the actual numbers of visits. It looks like the model actually captured the major trend, but missing a bunch of spikes here and there. It definitely need some form of tweaking later. But it’s pretty decent for all of our quick efforts here. Once we get the prediction model, so one natural question is, how does the model make a prediction? How each of the feature actually contribute to this black box or I will say I don’t really know inside out what the XGBoost actually does, but I want to understand how the prediction was made by the training model. In our case, the question became, how much each media inputs contributes to in-store foot traffic. So that’s the business context here again. So by directly using SHAP library, which is an open source model interpreter natively integrated with XGBoost, we can quickly derive the insights such as, what are the most important media channels driving my offline activity? So from this chart, this contribution chart. It looks like the social media and the landing page visit are two major driving forces. This chart can directly derive from the SHAP library by calculating the SHAP value. And it looks like each of the social media drive of 55% and landing page was driving 32%. SHAP can provide very granular insight for media mix contribution at the individual level. So here on this component chart, we can see how each media advert, positively or even negatively a little bit, impacts on one day in-store number visit. Finally we can create a full conversation chart for one year daily foot traffic time series. And this chart gives us a full picture of how the in-store visit attributes to each online media input. So from here, again scroll in and take a deep dive. We can make recommendation to the business executive or your agency clients how to do future campaign organization on their media spend. So with this it concludes today’s demo, thank you.
My name is Layla Yang. I am a Solutions Architect at Databricks. Before databricks I started my career in AdTech industry focusing on building Machine Learning models and data products. I spent few years at adtech startups to design, build and deploy automated predictive algorithm into production for real-time bidding (RTB) plugged in major Ad Exchange and SSPs. My work also included MMM (media mix modeling), DMP user segmentation and customer recommendation engine. Currently I work with start-ups in the NYC and Boston area to scale their existing data engineering and data science efforts leveraging Apache Spark technology. I studied physics back in university and I love skiing.