Nicolas Poggi - Databricks

Nicolas Poggi

Researcher, Databricks

Nicolas is a researcher overseeing the performance and scalability of new Spark releases at Databricks. Where he along with the Amsterdam SQL performance team is implementing the new benchmarking and monitoring infrastructure for the Databricks cloud platform. Previously, he was leading a project on upcoming architectures for Big Data processing at the Barcelona Supercomputing (BSC) – Microsoft Research joint center. Nicolas received his Ph.D. in Distributed Systems and Computer Architecture at UPC/BarcelonaTech, where he is still contributing part of the HPC and of the Data Centric Computing research groups.

UPCOMING SESSIONS

A Delta Tables Performance Evaluation in Modern WorkloadsSummit Europe 2019

Delta is a new open storage specification and engine for Apache Spark which adds consistency and reliability guarantees over traditional columnar storage. It also simplifies the lambda architecture by operating both as a streaming source and sink and supporting incremental inserts and updates. But what is the new performance and use cases over traditional storage? This talk evaluates the Delta storage engine running on the Databricks unified platform and compares it to the parquet format and common workarounds to achieve reliability in the cloud. The evaluation extends industry standard benchmarks to include streaming and modern customer use cases. A price-performance comparison is also provided to help practitioners decide when to use Delta in production.

PAST SESSIONS

Correctness and Performance of Apache Spark SQLSummit Europe 2018

In this talk, we present a comprehensive framework we developed at Databricks for assessing the correctness, stability, and performance of our Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools. Session hashtag: #SAISDev10