The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment. These challenges are complicated by production ML pipelines with interdependent modeling and featurization stages. Large tech companies have published popular reference architectures for ‘feature stores’ that address some of these challenges, and an active open source ecosystem provides a full workbench of power tools. Still, the abstract role of the feature store can be a barrier to implementation. We demonstrate an implementation of a feature store as an orchestration engine for a mesh of ML pipeline stages using Spark and MLflow. This is broader than the role of a metadata repository for feature discovery. The metadata in a feature store allows us to break the unit of deployment down to the level of the ML pipeline stage so that we can break the anti-pattern of ‘clone and own’ ML pipelines. We isolate concerns of pipeline orchestration and provide tooling for deployment management, A/B testing, discovery, telemetry and governance. We provide novel algorithms for pipeline stage orchestration, data models for feature stage metadata, and concrete systems designs you can use to create a similar feature store using open source tools.
Nate is a Data Architecture and ML Engineering consultant at Accenture. He leads the design and technical delivery of complex ML applications. With his background in productionizing research applications, he helps enterprise clients develop their playbook to transition from promising research results to high value industrialized deployments.