Edmunds.com is a car-shopping website that serves nearly 18 million visitors each month, and we heavily use data analysis to optimize the experience for each visitor. To accomplish that goal, the engineering team at Edmunds processes terabytes of data, and our business analysts use rich visualizations on traffic, revenue and car leads metrics to get insights on the car shopper journey. When our team was faced with the challenge of increasing the speed of the pipeline and empowering business analysts to be completely self-autonomous in the process of dataset creation, aggregation and visualization, we decided to use Apache Spark.
This talk is about that migration process and bumps along the road. First, the talk will address the technical hurdles we had to clear bringing up Spark – including the process of exposing our data in S3 for productionalized ETL and Ad Hoc analysis using Spark SQL in combination with libraries that we built in Scala. Then, we cover the benefits we were able to achieve – better data refresh intervals, faster queries times, and even increased productivity in our development process. Lastly, we cover the rich set of visualization and analysis tools we employ to make all these data marts easily accessible to our business analysts.