Rethinking Feature Stores

May 27, 2021 04:25 PM (PT)

Feature stores have emerged as a key component in the modern machine learning stack. They solve some of the toughest challenges in data for machine learning, namely feature computation, storage, validation, serving, and reuse.

However, the deployment of feature stores still requires a coordinated effort from multiple teams, comes with a large infrastructural footprint, and leads to integration costs and significant operational overheads. This large investment places feature stores completely out of reach for the average data team. What’s needed is a fundamental redesign of the feature store.

In this talk we will introduce a new light weight feature store framework that allows any data source to be operationalized by declaring them as dependencies to production ML applications, without coupling these applications to environment specific infrastructure. By publishing model-centric logical feature definitions, this framework will allow data scientists to build ML applications that depend on any data source, using their tools of choice, and deploy to their existing production infrastructure.

In this talk we will also demonstrate how this new paradigm empowers individual data scientists to develop and serve a production-grade ML application in less than one minute.


 

Transcript

Mike Del Balso: Okay, we’re going to talk about Productionizing Machine Learning with Feature Stores. Also, joining me on this talk is Willem, who’s the Creator of Feast and opensource feature store and a Tech Lead at Tecton. So, where did feature stores come from? Well, we worked on and built a machine learning platform at Uber called Michaelangelo. That platform, supporting all the machine learning that runs in production at Uber, thousands of real-time models, like fraud detection, systems, ETA, stuff like that. And this platform is really three parts. There’s a Feature Store layer, there’s a Model Management layer, and then there’s this Observability Model Monitoring layer. And the Feature Store was really like a hub, a central hub, for managing the data flows in the ML application, and allowed teams to iterate really quickly on features to transform data from raw data into ML features and deploy them to production quickly.
And this Feature Store became the foundation for all the ML systems at Uber. Today, feature stores are a whole category. We left Uber to create Tecton. Tecton is a fully managed enterprise feature store. Feast is another great feature store option, it’s an open-source feature store. Today, we’re going to talk a little bit about both of these. There’s a new release for Feast that we’re going to talk about and show a demo of at the end of this presentation. But overall, today, we’re going to talk about why it’s so hard to productionize machine learning and how a feature store helps with a lot of these data engineering challenges. We’re also going to talk about what’s within a feature store and how do they work and how are they relevant for core ML ops workflows. And then, at the end, we have an announcement of a new release of Feast. And we’re going to talk about the key design principles and show a nice demo of it.
So, let’s talk about why productionizing machine learning is so hard to modern data stacks? First, let’s talk about just what defines modern data stacks. Modern data stacks are really built around these cloud native Data Platforms. So, this is your cloud Data Warehouse or a next generation Data Lake. And they have really risen in popularity, dramatically, over the past few years. We love our Data Warehouses. And they’re great because modern data platforms, finally, offering teams good ways to centralize their business data, good ways to reliably clean that data, and aggregate it into useful formats, refine it into much higher value versions of that data that can be used for analytics, and share that team that data with their team for analytic purposes. So, these systems have really enabled analytics to be self-served, to be near zero maintenance, and to be really scalable.
But more than that, they’ve essentially revolutionized the analyst role, right? They unlocked insane productivity for them. But what about machine learning? What have they done for ML? Not the exact same thing. So, let’s just look at what is required to build a machine learning system. So, with ML, of course, we have to train our model. And so, we can do that with a data science environment, say a Jupyter Notebook and an ML framework. But it’s not just that, we also have to deploy our model. And we deploy our model into a different environment. Our models run in production. And imagine this is like a real-time model, that’s doing fraud detection or recommendations. And so, they need fresh access to feature data to make their predictions. So, both the model training and model serving need access to different feature data. And what’s really important and unique about machine learning is it just doesn’t work unless they get the same data, unless it’s consistent data.
So, how do we get this data into our training environment for model building? Well, ideally, we can use the same data, the same high value data that we have in our data warehouse or a data lake that our analytics teams have been refining and cleaning, and making much higher value. And we put a lot of energy and investment into that already. However, this is not always super easy, right? There’s a variety of reasons for this. And we’ll just jump into a couple of different ideas here. One, is just like, there’s the data that my model can use for predictions and production is not always available in my data warehouse, right? It could be for security reasons, production data doesn’t go there. Another reason is just, often the data that does line in the warehouse has been transformed. It’s been cleaned in some way.
For example, it’s been sanitized, it’s been typecasted, down sampled, ordered, and it’s not necessarily representative of what my model would see in production. So, it introduces some data consistency issues. And let’s also think about what about the data that is in production that my model would give for predictions? Well, maybe we can use the same data warehouse for that. And this also isn’t really easy. It’s actually even harder. These systems don’t have real-time serving. They don’t have good streaming support. They can’t do real-time transforms or transactional features. And what’s even a larger issue for important models is that engineering teams, production teams, run your production fraud model, and they don’t really trust taking the dependency for their production fraud model on an analytic data system, like a warehouse that has really minimal governance on changes.
And so, they don’t want to make their models, their production systems, easier to break because these people are going to be on call for this thing. So, there’s just a lot of complexity with this data. And there’s always the question of, what data can I use? When? And where is that data? And again, the challenge isn’t just connecting some offline data to a model in production. It’s in managing the complexity of ensuring that data is consistent across environments. For machine learning teams to be fast and effective, they can’t be dealing with this complexity in every single iteration [inaudible], they need tools and workflows to be able to iterate quickly on features and have them be easily available for training and production inference. Just remember that, the correctness of these ML systems, them functioning at all, depends on getting the same data inputs across these environments.
So, dealing with all of this complexity is the problem in productionizing ML. And it’s why it can take teams months, quarters, sometimes even more than a year, to get a new model into production. This is a huge topic. There’s a bunch of different problems here. In the real world, teams fail for a variety of different reasons. And let’s just look at one of the common workarounds for one of these problems just as an experiment, which might resonate with you, you might see this in your team. One thing that’s quite common for teams to do is to rebuild the offline pipelines online. So, they’ll take some transformations that have been built in a warehouse or some analytic environment, and then they’ll rebuild them in a streaming system or they rebuild them in some operational ETL environment. This doesn’t solve our problems. It allows data to get to these models, but it’s super slow and painful, engineering teams have to be really heavily involved in every single iteration, and it’s very error prone.
The data across these systems is now going through two different transformation pipeline, since so, there’s no guarantees of data consistency, and it always leads to the maintenance messes. And the biggest problem here is that data scientists don’t end up owning their work in production. And they don’t even know how the data that is being consumed by their model in production got generated. So, it’s a tricky situation. So, how can we make productionizing machine learning as fast and easy as building something like a dashboard on a warehouse? Feature stores are built exactly for this, right? They’re the hub for a data flow and a machine learning application. And they’re focused on solving these data complexities for putting ML applications into production. So, they do a bunch of things like ensuring transformations are consistently applied across environments, organizing data for use in ML, and making feature data accessible for online inference and offline training, and monitoring this data and keeping it validated.
The most important thing, though, is offering a very simple workflow for teams to quickly iterate between development and production with their data. So, the feature store does this, not by duplicating your data stack, but by integrating with it and extending it to provide data workflows and interfaces for ML developers. We believe that this workflow implications of feature stores are more impactful than any other component in the ML stack. And they’re critical if we want to make ML as easy as common analytic workflows today.
So, let’s talk about what’s in a feature store and how does it work? Well, let’s draw this out, right? So, of course, the feature store, we need to serve features in production, so we have your production application, it has to request features that the feature starts to deliver features to a model to make a prediction. It also needs to interact with your development environment. So, your notebook or your data science environment, you need to be able to talk to the feature and ask for a training dataset, right? Ask for different features so you can build your models. And the feature store, of course, is connected to your upstream data, stream sources, batch sources, this can also be operational sources like operational databases, even third-party data sources, common like vendor APIs. And so, what’s going on within the feature store?
Well, there’s a couple of main capabilities. First is serving. So, it’s all about delivering feature data to your model. So, ML models can require a consistent view of feature data across training and serving. When we’re creating training data, we need to actually train on historical examples. And so, it’s actually pretty common to, not only need to join features together to generate a training data set, but also to have historical values of features that go with… That represent what a features’ value was at a specific point in time in the past. This is a really huge concept and one that we can do a whole deep dive on. And then, on the serving side, your models need data, need this feature data fresh, they need it in real time, and they need it served at scale. And so, where does the feature store get this data from?
Well, it has a storage layer. Feature stores contain both an online storage layer and an offline storage layer. So, the online storage layer is what contains the freshest values of each feature and powers that real time serving to your model for inference. And the offline storage layer is what contains all the historical values of different features. So, such that, you can go back in time for training, for assembling training datasets, that represent historical examples. And the features stores organize the data and these storage layers in a really nice way to make features really easy to join together. So, they’re organized by a primary key, like an entity, we call them an entity ID, user ID, for features that are user features and they all are tagged with the timestamps, so we can go through different historical values. And then feature stores actually run the feature engineering transformations to convert raw data into feature data.
This happens by orchestrating transformation jobs on your existing infrastructure. So, think about like running a SQL query on your warehouse or running a job on your Spark cluster. Many of these feature pipelines are expensive and they need to actually, to compute a feature, you pre-computer, you compute it ahead of time before you need it for real-time prediction. And the feature store handles two things, running those pre computations, but also handling smart backfills for features. So, when you’re iterating on new features, you’re doing feature engineering as you’re experimenting with new features and building new models, trying new ideas. You need historical values of features and the feature story handles automatically back-filling those values, so you can easily get a new feature set. So, another really important capability for this hub of data flows for your ML application is monitoring, obviously. Ensuring data quality and validating data is extremely important to anybody running a production ML application.
So, the feature store is the interface between your models and your data, and then organize all the data that is important for model operation, but also for model debugging and auditing. So, feature stores do not only, data quality monitoring, but they’re also plugable with external monitoring or data observability systems. So, it’s really about organizing these data sets, making all of the training data, the serving data, accessible to any other system that wants to generate metrics and validate data quality. Feature stores can be extended, not only to monitor data quality, but also for operational metrics, like serving latencies and a variety of other things around feature computation. The final part here is the registry. The registry is the single interface for users to interact with [inaudible] single source of truth about features in an organization. For users, it contains all of the definitions and the metadata that can be used to discover new features and share and reuse features. The registry, all the data in the registry configures the feature store assistant behavior.
So, automated jobs, user registry to figure out when to ingest data, when transformation should happen, what data to store, and how to organize it. But all of this together, allows the feature store to become almost as data catalog of production ready signals. So, you suddenly have this library of signals that have already been found useful for other people, and they’re ready to go in production right now. But what’s great about this is that it removes a cold start problem for data science projects, for testing new data science ideas. Now you can, essentially, just like new idea, create a model with these set of previously existing features that others have already found useful and put it into… In a full-scale production experiment in minutes. And so, this dramatically speeds up the data science experimentation cycle, and really solves this cold start problem.
So, looking at the feature store as a whole, the feature store contains a variety of components that work well together to provide a nice platform, a nice hub, for all the data flows in your ML application. And it solves a lot of the data engineering challenges needed to build and maintain these production ML applications. Transformations, storage, serving these features, and then monitoring them to ensure they’re correct. And then the layer of discoverability and governance, that’s really important for sharing things across organizations. So, there’s a couple of things that are important to know if you’re considering using a feature store. The first is that, modern feature stores are really lightweight. We’re not talking about adopting the whole duplicate data stack. Feature stores aren’t introducing a new type of database, a custom reimplemented key value store. They run in your cloud environment, they reuse your existing data stack, they extend your existing infrastructure.
So, they reuse your Spark cluster, they reuse your warehouse and it will even reuse your storage layer, so its features will be stored in your warehouse or in your data lake, so you can access them for other use cases beyond just feature retrieval for machine learning as well. Getting benefit from a feature store is not an all or nothing proposition. Feature stores are incrementally adoptable. So, what does this mean? Well, you probably, already have some existing feature pipelines, right? So here we’re looking at two existing pipelines, one that powers or training, one that computes some features for your serving environment. And to get started with the feature, so you probably don’t want to have to rewrite these into the feature store or do a big migration. You really are just trying to add some new signals to your model and the feature store is all about that.
So, it’s really an incrementally adoptable system. You don’t need to rewrite your existing pipelines to begin using the feature stores, they’re built to work alongside your existing infrastructure that you were already happy with and it’s already built and working. So, it’s common for teams to start using feature stores, to just get the new features they’ve always wanted in production without dealing with migrating a bunch of old stuff that happens to be already working. And another thing that’s quite common is after teams get to this point, they actually really like the feature store interface. So, they want to integrate their existing pipelines with the feature store, so they get to take advantage of that unified feature retrieval interface that the feature store provides. And so, they end up connecting their existing pipelines directly to the feature store and having a single interface that they connect to, to retrieve their features across all of their models.
So, we just talked about a lot of different data engineering problems that need to be solved and putting machine learning into production. These problems are problems that ML ops teams are spending a ton of time on. But there are elegant solutions out there and we want to help all ML teams have a success here and do this well. So, we’re working on a series of contributions to the open source feature store called Feast, to establish a really high quality open feature store that can be used as a reference feature store architecture for all ML teams across the industry. So, I’m now going to hand it over to Willem, the creator of Feast. Who’s going to introduce the new release of Feast, and he’s going to show us a nice demo.

Willem Pienaar: Thanks, Mike. Hello, everybody. Let me introduce Feast 0.10 to you all. So, the feature store principles and capabilities that Mike just presented, on part of a broader vision we have for feature stores, Feast 0.10 is the first step towards that vision, but it’s specifically focused on serving. Our goal with Feast 0.10 was to make the simplest and fastest feature store for ML teams to serve their analytic data in production. Feast 0.10 provides training dataset building with point-in-time correctness, helping you avoid feature leakage, online serving of your data at low latency, and provides consistency across these two environments. This first release is GCP focused, meaning you get [inaudible] as your offline store, Firestore is with online store. The key problem we’re trying to solve with Feast 0.10 is, now, how can we make it easy for teams to deal with the complexity of data consistency across environments?
So, let me walk you through some of the key changes from Feast 0.10. Firstly, at zero config, with Feast 0.10, we allow you to deploy [inaudible] in field feature store without any custom configuration. Meaning, you can just get started in seconds. Secondly, local mode. Feast 0.10 ships with a full local mode. Meaning, you can test your end-to-end development workflows completely locally from either your ID or a notebook. Third, there’s no infra involved, you’re just getting started. That means no Kubernetes, no serving API, no Spark. If you want to use Kubernetes, you’re welcome to use the open source components that we provide. That’s always the option if you want flexibility, but it’s not required if you’re getting started. And then finally, Feast 0.10 is extensible. Chips with the provider model that allows teams to easily extend Feast for deployment into their own custom stacks.
So, let’s talk about the demo that we’re going to… I’m going to show you in a second. So, first, we’re going to show how we’re going to set up a feature store with Feast. Then we’re going to train a model using features from BigQuery, and we’re going to use features to interface to BigQuery. Then we’re going to do a testing of the model with the local online store using SQLite. And then we’re going to test the model with the production feature store. So, we’re going to deploy Feast to cloud, to GCP cloud, and we’re going to use Firestore for online serving.
So, here’s the use case. We are going to receive a list of driver IDs. And out of these drivers, we need to pick the driver that is most likely to complete a trip with a customer. And so, the way we want to do that is what a trainer model, and when we use that model to do [inaudible], but we have a requirement that we need to respond within a few milliseconds. So, ideally, this model can be served online with a source of data that is both fresh and can respond in low-latency. We know that we have a lot of data in BigQuery for drivers, so let’s have a quick look at that. We run a quick BigQuery query, and we should get results back in a sec. Right. So, now we can see that we have a bunch of data in the offline store and BigQuery on drivers conversion rate, acceptance rates, and average daily trips.
We can use this to train a model. So, we know that we have that available. So, let’s start a feast repository in order for us to set up our feature store as the first step. So, we run a Feast in it and we use a GCP template, and we create a driver ranking repository. So, that’s created, and we can have a look at the top left here. You’ll see that we have new folder here with two files. The first file is the feature store [inaudible]. This is a simple little file, and I just tell this Feast, you’re going to use the driver ranking as your project name. This is just the name spacing key. And the registry is going to be a local file, just a local DB file, where your feature definitions will be stored. And we’re going to use the provider as GCP, which means it target’s GCP for deployment, but we don’t want to get there yet.
So, we only use local as the provider here because we want to start testing and development locally. Let’s have a look at the driver [inaudible]. So, within your repository, you put a bunch of these Phyton files and these Phyton and files are just contained feature definitions defined as normal Python objects. So, in this case, we just have a single feature view. And this tells Feast the combination of the feature view and this BigQuery source, where to find features and then when it needs to materialize features for training, that is a generation or for online serving how to represent those features and what are the constraints and properties of those, of the data. So basically, this is just telling Feast right now, there’s a table with driver features and we can use that to both training data or to serve that online.
So, let’s just go into the driver ranking repository and run Feast [inaudible]. So, what this is going to do is register our features and set up a local online store. Since we set the provider to local, now we’ve got a SQLite online store. It’s empty right now, so let’s load some of those BigQuery data into our online store, so that it’s ready for prediction after we’ve trained our model. And we can just use the Feast materialize incremental command in order to do that. So, what this command does is it’s just query in BigQuery, finding the latest feature values for each driver, and loading that into the online store. So now, we’ve got a registry that’s full of feature definitions, and we’ve got an online store that is ready to serve data. But the first thing we wanted to do was set up a feature store and we’ve done that.
The second thing we want to do is train a model. Right. So, let’s go to our training script. Here, we have a script that basically just does four things. One, connects to the local feature store. Two, creates a training data sets. Three, trains a model. And then four, saves it. So, all you need to do to connect your Feast is create a feature store object and point it to the registry or to a local repo. And so, that’s what we’ve done here. Then, what you do is you just call a single method in order to build a training data set, get historical features. Feast is going to intelligently pull the training data set, and it’s going to query BigQuery, and find those driver features, and join them to the data frame that you provide. In this case, you’re providing an order’s data frame that has a rows that contain a column called trip completed.
And that just says,\did this driver complete or did they not complete a historical trip? And so, Feast is going to intelligently join those tables together in a point-in-time correct way. And then we’re going to train a model and save that model. So, let’s just quickly run the script. Right. And so, the key thing to understand here is that Feast can query from any amount of feature tables or feature views and join those together. So, all of the columns do not have to occur in a single source table, and it can stitch those tables together in a way that your timestamps do not have to align perfectly. So, it ensures point in time correctness, which prevents feature leakage. Right. So now, we’ve trained our model. And if you look on the left, the driver model binary is available. So, second step done.
We trained the model. The third step, please test this model against the local online store. So, we go over to our predict clause here, to our predict Python file. And what we’ve already set up here is a driver ranking class. And so, what this class is, typically does for most teams, is loads a model, and maybe even let the batch data into the clause itself. And it’s like a static data set that you just serve in production. But in some people, it connects straight to operational DBs, but that doesn’t ensure consistency. So, what we do here is we’re going to use Feast. So, we’re going to… When we initialize this clause, we’re going to load a model into memory, and then we’re also going to create a feature store object, just like we did for training. Let me show you what prediction looks like.
So prediction, remember, we’re going to get a list of driver IDs, right? So, we get that list of driver IDs. Step one, we’re going to read features from Feast to online features. As we take their driver IDs and we take the same list of features we use for training. And then we just query that from Feast. And then we get back, the data frame from Feast. And this, in this third step, we’re querying the local SQL and online store, meaning we’re not touching any cloud [inaudible], a completely local feature store. You can attract quickly and you can develop and then test this, before you actually go to product. And so, you got your data, you make a prediction, you choose the best driver, and you return the best driver ID. So, let’s quickly try that out. We should get a driver ID.
Yep. 1,001 is the best driver for completing this trip. Right. But let’s go to prod. We don’t want to use local forever. And so, the easiest way to go to prod is just, you change your provider to GCP, and then you just reapply the changes with Feast. So now, what Feast is going to do is it’s going to update your infrastructure based on the future views that you’ve registered and the feature definitions. So, in this case, the default online store for GCP is Firestore. So, now we’re going to scaffold Firestore, ready to receive data and ready to serve data. But there’s no data in there yet right now. So, let’s materialize data from our BigQuery source, our data source, into the online store. So, now we’re just synchronizing that. And so, we’re taking the driver hourly stats and we’re loading it into Firestore. And this’ll just take a few seconds.
Yep. Done. Okay. So, let’s rerun our prediction routine. The cool thing now, is we don’t need to change our online prediction routine. We can just rerun it because the online store has changed, but everything has been commoditized about the prediction script. And then we have 1,001, that’s the exact same result we’ve got locally, and now we have a production store. So, just as a recap, we set up a feature store, we train a model using Feast and querying data from BigQuery and pointed out correct way, and we use the local SQLite to test the online store capabilities quickly, and then we shipped Feast into production. And now, you can just ship your model into production and you have a complete production setup with almost no infrastructure that you have to maintain. It’s just the bucket and a serverless Firestore.
So, what’s next? So, Tecton, just to recap, is an enterprise feature store with advanced functionality like transformations, access control, it has a UI, and it’s managed servers with production SLAs. Feast, on the other hand, is a completely open source, production grade feature store that you can get started with today and deploy it to your own infrastructure. So, we’re just working towards a common standard for feature stores and the release of Feast 0.10 is the first step towards that.
Overtime, it will be converging our APIs and working towards a unified specification for feature definitions. So, what does that mean? It means that overtime, Tecton will be releasing more and more of its internal technologies into the Feast project, and we’ll have the possibility of being able to seamlessly move from Feast into Tecton [inaudible] offering. So, if you want to try it, Feast 0.10, check it out our quickstart or have a look at our documentation. If you want to run Feast on Kubernetes with Spark, that’s also available. You’re going to have a look at our Helm charts and our installation guides. And if a full-featured managed feature store is what you’re looking for, have a look at Tecton dot AI and reach out to us.

Willem Pienaar

Willem is a tech lead at Tecton where he currently leads open source development for Feast, the open source feature store. Willem previously led the data science platform team at GOJEK, working on the...
Read more

Mike Del Balso

Mike Del Balso is the co-founder of Tecton.ai, where he is focused on building next-generation data infrastructure for Operational ML. Before Tecton.ai, Mike was the PM lead for the Uber Michelangelo ...
Read more