We built a domain specific querying engine that can take an arbitrary number of statistical queries and turn this into a single fixed stage Spark job that has a single pass over our transactional data. This talk will focus on the technical & algorithmic difficulties of building a robust production ready Big Data application with Spark. Algorithmic: how we simultaneously achieved lazy, memoized and distributed computation using functional programming and Scala. Technical: common gotchas and solutions around tuning, serialization, GC, compression and debugging.
Scala, Distributed Computing, Hadoop, Big Data, Spark, Data Mining, Networking, Stochastic Mathematical Modelling