RJ Nowling is a Software Engineer in Emerging Technology at Red Hat, Inc., where he’s part of a data science team that consults for internal customers. RJ is a committer on Apache BigTop, a contributor to Apache Spark, and co-lead of the BigPetStore family of big data example applications. Before joining Red Hat, RJ focused on academic research in the fields of computational physics, bioinformatics, and distributed systems. He’s currently a Ph.D. candidate at the University of Notre Dame.
Detailed customer profiles resulting from customer segmentation make sales teams more effective, enable more personalized customer service, and highlight cross-selling opportunities. Successful customer segmentation requires overcoming numerous challenges. Features, models, and summarization techniques must be both human interpretable and accurate, while large data volumes and hyper-parameter optimization present significant computational challenges. Spark's combination of high-performance distributed processing and extensive libraries make it an attractive platform for tackling customer segmentation. By using Spark, Red Hat has significantly reduced run times of our workflows from days to minutes, enabling us to tackle computationally-challenging analyses we couldn't have previously. We recently used Spark to implement a customer segmentation pipeline that operates on customer-portal clickstream data and used it to analyze millions of page views by hundreds of thousands of users. Thanks to nearly instant results, we were able to rapidly iterate over our workflow, algorithms, and hyper-parameter choices to optimize accuracy and achieve better results. In this talk, we'll describe our pipeline and the resulting insights into customer behavior, the advantages of using Spark, and lessons learned about data cleaning, choosing algorithms, and performance optimization.