Bogdan Ghit is a computer scientist and software engineer at Databricks, where he works on optimizing the SQL performance of Apache Spark. Prior to joining Databricks, Bogdan pursued his PhD at Delft University of Technology where he worked broadly on datacenter scheduling with a focus on data analytics frameworks such as Hadoop and Spark. His thesis has led to a large number of publications in top conferences such as ACM Sigmetrics and ACM HPDC.
Imagine an organization with thousands of users who want to run data analytics workloads. These users shouldn’t have to worry about provisioning instances from a cloud provider, deploying a runtime processing engine, scaling resources based on utilization, or ensuring their data is secure. Nor should the organization’s system administrators.
In this talk we will highlight some of the exciting problems we’re working on at Databricks in order to meet the demands of organizations that are analyzing data at scale. In particular, data engineers attending this session will walk away with learning how we:
In this talk, we present a comprehensive framework we developed at Databricks for assessing the correctness, stability, and performance of our Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools. Session hashtag: #SAISDev10