Software engineering evolved around certain best practices such as versioning code, dependency management, feature branches, etc. However, the same best practices have not translated to data science. Data scientists who update a stage of their ML pipeline need to understand the cascading effects of their change so that their downstream dependencies do not end up with stale data, or unnecessarily rerunning the entire pipeline end-to-end. When data scientists collaborate, they should be able to use the intermediate results from their colleagues instead of computing everything from scratch.
This presentation shows how to treat data like code through the concept of Data-Driven Software (DDS). This concept, implemented as a lightweight and easy-to-use python package, solves all the issues mentioned above for single user and collaborative data pipelines, and it fully integrates with a lakehouse architecture such as Databricks. In effect, it allows data engineers and data scientists to go YOLO: you only load your data once, and you never recalculate existing pieces.
Through live demonstrations leveraging DDS, you will see how data science teams can:
Brooke Wenig is a Machine Learning Practice Lead at Databricks. She leads a team of data scientists who develop large-scale machine learning pipelines for customers, as well as teach courses on distributed machine learning best practices. She is a co-author of Learning Spark, 2nd Edition, co-instructor of the Distributed Computing with Spark SQL Coursera course, and co-host of the Data Brew podcast. She received an MS in Computer Science from UCLA with a focus on distributed machine learning. She speaks Mandarin Chinese fluently and enjoys cycling.
Tim Hunter is a senior AI specialist at the ABN AMRO Bank. He was an early software engineer at Databricks and has contributed to the Apache Spark MLlib project, and he has co-created the Koalas, GraphFrames, TensorFrames and Deep Learning Pipelines libraries. He holds a Ph.D. in Machine Learning from UC Berkeley and he has been building distributed Machine Learning systems with Spark since version 0.0.2, before Spark was an Apache Software Foundation project.