Fabio Buso is the head of engineering at Logical Clocks AB, where he leads the Feature Store development. Fabio holds a master’s degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin.
November 17, 2020 04:00 PM PT
Feature Stores for machine learning (ML) are a new class of data platform for the organization, governance, and sharing of features within enterprises. A typical feature store is a dual database architecture, where pre-computed features for training are stored in a scalable SQL platform (Delta Lake, Apache Hudi, Apache Hive), while features served to online applications are stored in a low-latency database or key-value store (MySQL Cluster (NDB), Cassandra, or Redis). Feature Stores, however, do not provide a solution for real-time features (such as user-entered data or machine-generated data) that cannot be pre-computed or cached. If the feature engineering code that transforms the raw data into features is embedded in applications, it may need to be duplicated outside the application in pipelines for generating training data.
In this talk, we introduce Hof (Hopsworks real-time feature engineering) that provides transformation of raw data to features at low latency and scale using Apache Spark Streaming, Pandas UDFs and PyArrow. Applications use Hof by sending raw data to a HTTP or gRPC endpoint and receive the engineered features, before sending the full feature vector to the model for prediction. Hof enables the real-time feature engineering pipeline to be reused across both real-time and offline use cases (when creating training data for the same features). Hof can also enrich real-time features and build complete feature vectors by joining real-time features with features from the online feature store. We will show how the core feature store principles can be extended to real-time feature engineering: code tracking, feature pipeline reuse, ensuring the consistency of features between training and serving, and automated metadata and statistics for features. Finally we will show how the Hof architecture enables real-time features to be debugged, audited and saved for re-use in training models.
Speaker: Fabio Buso
June 23, 2020 05:00 PM PT
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies' feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how our platform, Hopsworks, seamlessly integrates with Spark-based platforms, such as Databricks. With the Feature Store, we will demonstrate in Databricks how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). We will also show the potential of Koalas for making feature engineering even easier on PySpark. Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.