Kim Hammar is a software engineer at Logical Clocks AB, and the main developer of Hopsworks’ Feature Store – the world’s first open-source Feature Store. He received his MSc in Distributed Systems from KTH in 2018. He has previously worked as an engineer at Ericsson, as a researcher at KTH Royal Institute of Technology, as well as a data scientist at Allstate.
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world's first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers - who write feature engineering code in Spark (in Scala or Python) - and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.
Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.
We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.