I work in the Data Platform Team at Facebook focusing on scaling distributed database systems. Previously i was working at Teradata on developing scalable, open source data processing systems. I received my degree in Applied Mathematics from Lviv National University.
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto's state-of-art low-latency evaluation with Spark's robust and fault tolerant execution engine. To this end, we'll present Presto-on-Spark, a highly specialized Data Frame application built on Spark that leverages Presto's compiler/evaluation engine with Spark/Cosco's execution engine. In this talk, we'll take a deep dive in Presto and Spark's architecture with a focus on key differentiators (e.g., disaggregated shuffle) that are required to further scale Presto. We'll then present the Presto-on-Spark project in detail, and discuss the motivation, design and current status of this project. We believe this is only a first step towards more confluence between the Spark and the Presto communities, and a major step towards enabling unified SQL experience between interactive and batch use cases.