Delta is a new open storage specification and engine for Apache Spark which adds consistency and reliability guarantees over traditional columnar storage. It also simplifies the lambda architecture by operating both as a streaming source and sink and supporting incremental inserts and updates. But what is the new performance and use cases over traditional storage? This talk evaluates the Delta storage engine running on the Databricks unified platform and compares it to the parquet format and common workarounds to achieve reliability in the cloud. The evaluation extends industry standard benchmarks to include streaming and modern customer use cases. A price-performance comparison is also provided to help practitioners decide when to use Delta in production.
Nicolas is a researcher overseeing the performance and scalability of new Spark releases at Databricks. Where he along with the Amsterdam SQL performance team is implementing the new benchmarking and monitoring infrastructure for the Databricks cloud platform. Previously, he was leading a project on upcoming architectures for Big Data processing at the Barcelona Supercomputing (BSC) - Microsoft Research joint center. Nicolas received his Ph.D. in Distributed Systems and Computer Architecture at UPC/BarcelonaTech, where he is still contributing part of the HPC and of the Data Centric Computing research groups.