Aaron Davidson - Databricks

Aaron Davidson

Software Engineer, Databricks

Aaron Davidson is an Apache Spark committer and software engineer at Databricks. His Spark contributions include standalone master fault tolerance, shuffle file consolidation, Netty-based block transfer service, and the external shuffle service. At Databricks, he leads the Performance and Storage team, working on the Databricks File System (DBFS) and automating the cloud infrastructure.



Accelerating the Machine Learning Lifecycle with MLflow 1.0Summit 2019

Last year, Databricks launched MLflow, an open source framework to manage the machine learning lifecycle that works with any ML library to simplify ML engineering. MLflow provides tools for experiment tracking, reproducible runs and model management that make machine learning applications easier to develop and deploy. In the past year, the MLflow community has grown quickly: 80 contributors from over 40 companies have contributed code to the project, and over 200 companies are using MLflow. In this talk, we’ll present our development plans for MLflow 1.0, the next release of MLflow, which will stabilize the MLflow APIs and introduce multiple new features to simplify the ML lifecycle. We’ll also discuss additional MLflow components that Databricks and other companies are working on for the rest of 2019, such as improved tools for model management, multi-step pipelines and online monitoring.

A Deeper Understanding of Spark InternalsSummit 2014

This talk will present a technical “”deep-dive”” into Spark that focuses on its internal architecture. The content will be geared towards those already familiar with the basic Spark API who want to gain a deeper understanding of how it works and become advanced users or Spark developers. This talk will walk through the major internal components of Spark: The RDD data model, the scheduling subsystem, and Spark’s internal block-store service. For each component we’ll describe its architecture and role in job execution. We’ll also provide examples of how higher level libraries like SparkSQL and MLLib interact with the core Spark API. Throughout the talk we’ll cover advanced topics like data serialization, RDD partitioning, and user-defined RDD’s, with a focus on actionable advice that users can apply to their own workloads. Learn more: