Using Apache Spark to Predict Installer Retention from Messy Clickstream Data - Databricks

Using Apache Spark to Predict Installer Retention from Messy Clickstream Data

Download Slides

Clickstream data is messy. A single user session in a Zynga game can generate thousands of events, with each game, client version and OS having their own event schemas. Unfortunately, most ML models require their training data to be formatted as a uniform matrix, with each user having the exact same columns. It’s a time consuming challenge to develop feature sets that capture all the nuanced trends and interactions of event streams.

At Zynga we’ve developed a technique to represent user game actions with temporal heatmap feature sets. Utilizing the power of PySpark, our generic data pipeline can generate thousands of features without the need to manually interpret the events of each game. The graphical structure of the heatmaps allow us to take advantage of established image classification techniques to make personalized user level predictions. Within 30 minutes of installing our games, Zynga is able to make accurate predictions on whether a new installer will churn or become a payer.

Session hashtag: #DSSAIS15



« back
About Patrick Halina

Patrick Halina is the tech lead of ML Engineering at Zynga, based out of the Toronto office. He works on ML and Analytics infrastructure. He received his bachelor’s in Computer Engineering and master’s in Statistics from the University of Toronto. Prior to working at Zynga he worked at Amazon, developing the marketing ML platform.