Today, we announced the launch of the Databricks Feature Store, the first of its kind that has been co-designed with Delta Lake and MLflow to accelerate ML deployments. It inherits all of the benefits from Delta Lake, most importantly: data stored in an open format, built-in versioning and automated lineage tracking to facilitate feature discovery. By packaging up feature information with the MLflow model format, it provides lineage information from features to models, which facilitates end-to-end governance and model retraining when data changes. At model deployment, the models look up features from the Feature Store directly, significantly simplifying the process of deploying new models and features.
Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake.
Raw data (transaction logs, click history, images, text, etc.) cannot be used directly for machine learning (ML). Data engineers, data scientists and ML engineers spend an inordinately large amount of time transforming data from its raw form into the final "features" that can be consumed by ML models. This process is also called "feature engineering" and can involve anything from aggregating data (e.g. number of purchases for a user in a given time window) to complex features that are the result of ML algorithms (e.g. word embeddings).
This interdependency of data transformations and ML algorithms poses major challenges for the development and deployment of ML models.
The Databricks Feature Store is the first of its kind that is co-designed with a data and MLOps platform. Tight integration with the popular open source frameworks Delta Lake and MLflow guarantees that data stored in the Feature Store is open, and that models trained with any ML framework can benefit from the integration of the Feature Store with the MLflow model format. As a result, the Feature Store provides several unique differentiators that help data teams accelerate their ML efforts:
The Feature Store is a central repository of all the features in an organization. It provides a searchable record of all features, their definition and computation logic, data sources, as well as producers and consumers of features. Using the UI, a data scientist can:
Feature Store search using feature name 'customer_id', feature tables with 'feature_pipeline', and 'raw_data' sources in their name.
The Databricks Feature Store is fully-integrated with other Databricks components. This native integration provides full lineage of the data that was consumed to compute features, all of the consumers that request features, and models in the Databricks Model Registry that were trained using features from the Feature Store.
Feature table 'user_features.behavior', shows lineage to data sources and producers notebook; models; as well as consumer models, endpoints, and notebooks
The Feature Store supports a variety of offline and online feature providers for applications to access features. Features are served in two modes. The batch layer provides features at high throughput for training of ML models and batch inference. Online providers serve features at low latency for the consumption of the same features in online model serving. An online provider is defined as a pluggable common abstraction that enables support for a variety of online stores and support APIs to publish features from offline to online stores and to look-up features.
Features are computed using batch computation or as streaming pipelines on Databricks and stored as Delta tables. These features can then be published using a scheduled Databricks Job or as a streaming pipeline. This ensures consistency of features used in batch training and features used in batch or online model inference, guaranteeing that there is no drift between features that were consumed at training and at serving time.
Integration with the MLflow Model format ensures that feature information is packaged with the MLflow model. During model training, the Feature Store APIs will automatically package the model artifact with all the feature information and code required at runtime for looking up feature information from the Feature Registry and fetching features from suitable providers.
feature_spec.yaml containing feature store information is packaged with the MLflow model artifact.
This feature spec, packaged with the model artifact, provides the necessary information for the Feature Store scoring APIs to automatically fetch and join the features by the stored keys for scoring input data. At model deployment, this means the client that calls the model does not need to interact with the Feature Store. As a result, features can be updated without making changes to the client.
The FeatureStoreClient Python library provides APIs to interact with the Feature Store components to define feature computation and storage, use existing features in model training and automatic lookup for batch scoring, and publish features to online stores. The library comes automatically packaged with Databricks Runtime for Machine Learning (v 8.3 or later).
The low level APIs provide a convenient mechanism to write custom feature computation code. Data scientists write a Python function to compute features using source tables or files.
To create and register new features in the Feature Registry, call the create_feature_table API. You can specify a specific database and table name as destinations for your features. Each feature table needs to have a primary key to uniquely identify the entity's feature values
In order to use features from the Feature Store, create a training set that identifies the features required from each feature table and describes the keys from your training data set that would be used to lookup or join with feature tables.
The example below uses two features ('total_purchases_30d'
and 'page_visits_7d'
) from the customer_features table, and uses the customer_id from the training dataframe to join with the feature tables' primary key. Additionally, it uses 'quantity_sold'
from the product_features table and the 'product_id'
and 'country_code' as composite primary keys to lookup features from that table. Then it uses the create_training_set API to build the training set, excluding keys not required for model training.
You can use any ML framework to train the model. The log_model API will package feature lookup information along with the MLmodel. This information is used for feature lookup during model inference.
When scoring a model packaged with feature information, these features are looked up automatically during scoring. The user only needs to provide the primary key lookup columns as input. The Feature Store's score_batch API, under the hood, will use the feature spec stored in the model artifact to consult the Feature Registry for the specific tables, feature columns and the join keys. Then the API will perform the efficient joins with the appropriate feature tables to produce a dataframe of the desired schema for scoring the model. The following simple use code depicts this operation with the model trained above.
To publish a feature table to an online store, first specify a type of an online store spec and use the publish_table API. The following version will overwrite the existing table with recent features from the customer_features table from the batch provider.
publish_features supports various options to filter out specific feature values (by date or any filter condition). In the following example, today's feature values are streamed into the online stores and merged with existing features.
Ready to get started or try it out for yourself? You can read more about Databricks Feature Store and how to use it in our documentation at AWS, Azure and GCP.