As a startup, we’ve had the luxury of developing our batch processing infrastructure from scratch allowing us to incorporate some unconventional combinations of technologies. I will outline our current infrastructure, Spark running over mesos with data stored exclusively on S3 as a mixture of raw data in Hadoop sequence files and Parquet files, and explain the advantages it offers us over a more typical setup with Spark running on top of YARN backed by HDFS. However, running spark in this way has not been without challenges and a few set backs. I will highlight a few of the larger problems we encountered and what we did to solve them. Despite the challenges, choosing Spark has opened up many possibilities for us. To highlight the performance and flexibility we gained by using Spark, I will dive into one process, Retention, that we originally implemented using Cascalog (Datalog translated into Cascading/Hadoop) and then rewrote as a Spark job.
I researched epi-genetics as a post-doc at the Weizmann Institute in Israel after earning a PhD in Biophysics from University of California San Francisco. I left the world of academia to crack Big Data problems. That's why I joined the AppsFlyer Dev team where I work with our Big Data using Spark through Clojure.