Talk to any data science organization and they will almost unanimously tell you that the biggest challenge to building high quality AI models is accessing and managing the data. Over the years, practitioners have turned to a variety of different technologies and abstractions to help accelerate experimentation and development. In the past few years, Feature Stores have become an increasingly popular way for practitioners to organize and prepare their data for machine learning. In early 2022, Databricks made its Feature Store generally available. This summer, we are excited to introduce feature engineering and management as native capabilities in Databricks Unity Catalog. It marks a major evolution in how AI data can be more simply managed. This evolution unites feature management with a best in class data catalog simplifying and securing the process of creating features and using them to train and serve models.
Feature Stores are a type of catalog designed to meet two primary requirements, they must facilitate the easy discovery and usage of ML data and they must make that stable high quality data easily available to high performance model training and serving systems. Feature Stores enable data scientists to easily discover new features available in their organization, add new features, and effortlessly use them directly in their ML applications.
The Unity Catalog provides centralized access control, sharing, auditing, lineage, and data discovery capabilities across your Lakehouse and your Databricks workspaces. As we worked with Feature Store customers, they would time and again ask for Unity Catalog capabilities such as sharing and governance of their features. It became increasingly clear … "Why have these two separate catalogs: one for your features and one for everything else?"
Once we started to implement the unified Features in Unity Catalog experience, it became evident just how impactful this evolution of the feature store would be on many aspects of the AI development workflow.
Feature Engineering in Unity Catalog simplifies the training and deployment of models by building feature store capabilities directly into the Unity Catalog, the catalog that manages the Lakehouse.
Organizations typically want to standardize on a single ELT framework for all data engineering pipelines in order to maintain consistency and ensure enterprise policies are applied to all datasets in the Lakehouse. Merging the feature engineering capabilities into Unity Catalog enables organizations to use the same standardized ELT framework to write and maintain feature engineering pipelines.
To simplify the process of creating new features in Unity Catalog, we upgraded the SQL syntax to support TIMESERIES
clause as part of the PRIMARY KEY
constraint. This enables applications that automatically use features for training and scoring models, to perform appropriate point-in-time joins [AWS][Azure][GCP]
Customers may have existing feature tables created using a home grown feature store implementation, open source libraries, or vendor DSLs. By adding the PRIMARY KEY
constraint on these Delta tables they use these features directly to train and serve ML models.
MLflow models trained on Databricks using features automatically capture the lineage to the features used in model training. This lineage is stored as a feature_spec.yaml
artifact within the model. This addresses a pain-point that the users do not need to independently maintain a mapping of models and features. Inference systems can use this specification and feature metadata for model scoring. Additionally, this information can be used for lineage graphing systems to show all the features required for a model and forward links from a feature to all the models that use it.
When models are deployed in Databricks Model Serving, the system uses lineage to track the features required for inference and uses the appropriate online table in the Lakehouse to serve features. This simplifies the code an MLOps engineer needs to write for model scoring. They only need to call the model serving endpoint with the necessary IDs and features are automatically looked up. Furthermore since models, features, and other data assets in the Unity Catalog, all access to these assets follow the same enterprise governance.
Data Scientists can find all the features created using Databricks Feature Store APIs or other ELT frameworks and SDKs. You can select a specific catalog from Unity Catalog to list all Delta tables with primary keys. However, user tags simplify this curation and discovery journey and address various use cases like
Unity Catalog discovery tags can be applied across catalogs and schemas. Users can apply these tags for different entities like tables, views, models, functions.. etc. Additional guidelines to explore user tags in Unity Catalog are available for AWS, Azure, and GCP.
You can discover new features in the Lakehouse by clicking on the Features button under Machine Learning in the left navigation. By selecting a catalog you can see all the existing tables you can use as features to train ML models.
To get started, following the Feature Engineering in Unity Catalog documentation available info AWS, Azure, and GCP. You can get started with this end-to-end notebook.