Nicholas Chammas

Data Engineer, MassMutual

Nick has been working with Spark since 0.9. He’s had stints at Turbine, the Recurse Center, and Databricks. In his free time he hacks on Flintrock, his answer to spark-ec2.

SESSIONS

Building a Scalable Record Linkage System with Apache Spark, Python 3, and Machine Learning

MassMutual has hundreds of millions of customer records scattered across many systems. There is no easy way to link a given customer's information across all these systems to build a comprehensive customer profile. Building such a profile has important applications in many areas of MassMutual's business, from marketing to underwriting. To address this issue, MassMutual built Splinkr, an internal solution that links customer records across these disparate systems in a flexible and scalable way. In this talk Nick will share his experience building Splinkr with Apache Spark, Python 3, and simple machine learning techniques. He'll cover the good parts of his experience working with this stack as well as the bad, from working with clean APIs and readily available libraries to dealing with nasty Spark bugs, deployment difficulties, and bad training data.

Flintrock: A Faster, Better spark-ec2

spark-ec2 is a handy little tool for spinning up Spark clusters on EC2, but it has a few frustrating problems that are difficult to solve within its current architecture. In this talk, Nick will give an overview of Flintrock, a single-purpose command-line tool for launching and interacting with Spark clusters on EC2. Flintrock is open source and aims to be the spiritual successor to spark-ec2.