If you’ve brought two or more ML models into production, you know the struggle that comes from managing multiple data sets, feature engineering pipelines, and models. This talk will propose a whole new approach to MLOps that allows you to successfully scale your models, without increasing latency, by merging a database, a feature store, and machine learning.
Splice Machine is a hybrid (HTAP) database built upon HBase and Spark. The database powers a one of a kind single-engine feature store, as well as the deployment of ML models as tables inside the database. A simple JDBC connection means Splice Machine can be used with any model ops environment, such as Databricks.
The HBase side allows us to serve features to deployed ML models, and generate ML predictions, in milliseconds. Our unique Spark engine allows us to generate complex training sets, as well as ML predictions on petabytes of data.
In this talk, Monte will discuss how his experience running the AI lab at NASA, and as CEO of Red Pepper, Blue Martini Software and Rocket Fuel, led him to create Splice Machine. Jack will give a quick demonstration of how it all works.
Monte Zweben: Hi, I’m Monte Zweben, CEO, and Co-founder of Splice Machine. And I’m here with my colleague, Jack Ploshnick, to talk about unified machine learning ops, and feature stores, and model deployment and how these come together to really scale machine learning. So let’s take a look at our organization for the talk. We’ll talk about the goals of production machine learning and why these are hard to achieve and transition to how a feature store makes it easier. And what is the feature store landscape? And what’s a fresh approach to feature stores and how a new approach to deployment with feature stores can really facilitate the scaling of machine learning models in production. So let’s take a step back and look at the real time machine learning components. What’s necessary to have a machine learning stack to support real-time applications? Clearly you need modeling tools like notebooks, experimentation tools like MLflow to keep track of your experiments.
Deployment mechanisms, often end point deployment mechanisms to push containers out that may be making real time predictions. And underneath this is the new component of a feature store that helps with reusability and governance and serving features all built on scale-out data platforms. Some of the data platforms are necessary for operational workloads, others for analytical workloads. And if you look at the landscape today, some of these are very well taken care of by many vendors in the marketplace, including Databricks in the Spark community with modeling notebooks and MLflow. But where there really is a gap is the deployment of real-time models and the ability to serve those models and train those models in a very scalable fashion. And that’s what we’re going to talk about today. So looking at the typical machine learning infrastructure, you usually see teams of data scientists, essentially building bespoke pipelines, meaning, for each model, and perhaps even for each feature in their model, they’re building a SQL code, perhaps some PySpark code to facilitate the construction of their model, but their colleagues are doing the same thing.
And what ends up happening is that there’s a large set of pipelines that are constantly being executed and extracting data from data sources on the left-hand side, from data warehouses, maybe application databases, perhaps real-time events that are taking place in web and mobile applications and building these pipelines over and over again. And this duplication leads to problems. It leads to higher computing costs because there’s so much duplication going on on these pipelines, on the right hand side is this duplicative recreation of features that might be slightly off. One data scientist may have a certain summation or aggregation of sales order of revenues and another one has the same thing, but one included tax the other one didn’t and these are not quite the same.
This is not only duplicative, but partially inconsistent. On the lower left-hand side lost signal is as a hidden problem with this bespoke pipeline approach to building real-time models for a few reasons, one that you may not be getting the best features that have been fully vetted from both a performance perspective in terms of computational performance or from predictive performance perspective, as well as data leakage, where you’re training models on data that really didn’t ever occur because of the way you’re extracting features in your training examples. We’ll talk about that.
And then lastly, on the lower right, being able to go back from a prediction made in a real-time model all the way back to the features used in that prediction back again to what was used to train the model, back again to what was the state of the data platform that sourced that data and was the source of those pipelines. This is hard, this is really hard for data scientists to keep track of and a feature store can actually serve to greatly facilitate that. So let’s talk about what a feature store is and how data scientists and data engineers can use it. So what is a feature store? Feature store takes data in either real time data that are events or batch data from data warehouses and databases, and serves as a single source of truth and a repository that facilitates four main functions.
Being able to search for features, being able to create training sets, serving features and governance. And we’ll talk more about these functions individually in a moment. But what the final picture is, is unlike the duplicative bespoke pipelines that we saw before, you get a cleaner source destination with using a feature store. Now you can build pipelines that extract data from data warehouses, databases and real-time events, and bring that into the feature store and reuse pipelines over and over again in models and even in dashboards that may not be using machine learning but are simply using these curated features that are now available for the whole enterprise to use. So what are the requirements of a feature store? There are many. Just to call out a few, the most important and I think the most challenging elements of a feature store is scalability.
Being able to handle features on billions of business entities and to be able to handle wide feature vectors, perhaps tens of thousands of feature vectors has proven to be very challenging as companies move from the beginnings of their journey in machine learning to having 10, 20, 50, or 100 models in production. And being able to serve feature vectors, even wide feature vectors at the millisecond timeframe has proven to be very difficult. And one of the most important requirements is point in time consistency. And we’ll talk a little bit about that nuance about building training sets that are aligned on time series data. So let’s look at the feature store landscape today. Existing architectures for many future stores take raw data and either ingest that raw data into a key value store, perhaps in a streaming fashion or in a more batch fashion into an analytics engine, and then use both of these computational stores to serve either training sets or to serve models in real time.
And of course, when you have two different computational engines to extract from, it’s kind of hard to keep them in sync and in fact in an existing architectures, these get out of sync and it becomes difficult to maintain the training sets and the real time feature survey. So one approach to solving this problem architecturally with feature stores is to use an HTAP database as a single computational engine underneath the feature store. HTAP stands for hybrid transactional and analytical platform, meaning it can handle very low latency types of record lookups and range scans as well as large analytical pipelines. So you can literally have one engine in the store that can be used to find features, to serve features and to train features. So this becomes a cleaner, more consistent architecture than the previous approaches to feature store architectures. But HTAP databases have proven to be difficult in the past.
Some of the challenges of using HTAP databases is that many of the HTAP databases have required the full datasets to be in memory. That’s a limitation from both a cost as well as a scalability perspective. Some HTAP databases require specialized hardware that are essentially architected and engineered for the dual workloads. Some of the HTAP databases don’t support the full ANSI SQL environment like having secondary indexes when you want to attack a table from multiple dimensions and still provide low latency lookup, or triggers to be able to trigger functionality upon the changes that are made in the database. And then lastly, many of the HTAP databases are not ACID compliant, but perhaps eventually consistent where the analytical workloads and the transactional workloads are not fully ACID compliant, meaning that they adhere to the principles of being atomic, and consistent, and isolated and durable.
But the approach that we’re going to show today is utilizing an ACID compliant HTAP database to get past some of these limitations here. And the ideas behind this is underlying data engine is at first it’s scale out. So data is distributed in a shared nothing matter across multiple nodes. It can be deployed on any kind of infrastructure, whether that’s on any cloud, including Google, and Azure, and Amazon, or on premises. It supports full ANSI SQL with the ability to have secondary indexes and triggers and is fully ACID compliant. And the way it does so is by introspecting any SQL statement and looking at the estimated cost of that statement and making a decision on the fly, is this going to be a short low-latency statement that may be doing a single record lookup or short range scan, in which case the query gets compiled after being optimized into byte code and executed on a very fast key value store, we use Apache HBase under the covers.
You don’t have to manage the Apache HBase engine in this SQL store, you actually get to use its fast, low latency operations automatically without having to manage it. On the other hand, if you’re doing a long training pipeline, a feature engineering pipeline, and you’ve got lots of joins and aggregations and groupings in your statement, the cost-based optimizer will compile that query into byte code, but execute that on Apache Spark. So this gives you the best of both worlds there. So the way we use this implementation of an HTAP database for a feature story implementation is in the following way. For any grouping of features, we call that a feature set and that feature set is usually describing some business entity, let’s say, for example, a customer. That customer is going to have a primary key so that we can look up the features for that customer really quickly using this HTAP database with a primary key structure, we can look up very quickly the record for a business entity.
Now, if you were about to change some of the features, you update the features set table representing a grouping of features and under the covers automatically a database trigger fires here in the middle, that bridges real time feature serving to a transformation of the old values of those features into a time series representation, where there is a primary key maintained, and an as, of, until timestamp that is representing the older values of the feature set, and this is all done automatically. And the reason why this is important is that now on the right hand side you have a full version history in time series format of all of your feature changes and you can align training sets using this history, but on the left-hand side, you have a real time feature set available to serve models in real time with very fast look-ups. And we’ll see this in action in just a few minutes.
Any feature store has a very intuitive API for data scientists. And one example is a single API to create training views and training sets and another API to be able to get an arbitrary feature vector. So part of the job of a feature store though, is to essentially tie features to models and the unified MLOps architecture requires the combination of feature stores with a unique approach to model deployment. And one of the most important elements of model deployment is the ability to have a prediction store so that you have lineage and can monitor for drift of your concepts and be able to compare new models to historical models. So taking a look at what we mean by a prediction store here on the right-hand side, imagine for every user who asked a model for a prediction there was a record in the database that had what model they used under the run ID and the features they served for that particular prediction and then the prediction that came out of the model.
If one had one of these memorialized tables for every model, you can go back and easily track drift and easily monitor and govern your models. So one of the mechanisms in our MLOps architecture which is a unique approach to deploying models is something called database deployment. And with database deployment what we do is take any model and serialize that into the HTAP database. And what happens is that the data scientist, or the machine learning engineer, or just the application developer can just insert records into this evaluation store.
So as they insert records with a record of a timestamp of when a prediction is going to take place, the model that they want to use, and the features for that prediction, and this can be done on a single record basis or in a big batch basis, the database triggers associated with this evaluation store that are automatically constructed deserialize or grab the model from a cache, apply those features to the model and then memorialize the output here at, truly, millisecond speed.
So this offers a new approach to being able to deploy models and serve models, as opposed to a container approach or an end point approach that has a restful API only, this is an approach where you can simply use SQL and get the benefits of being able to serve models in real time.
So this now becomes a new unified architecture for MLOps where a feature store is able to both serve your features in real time and perform consistent training of your data sets by tracking feature history, and then using the database deployment and an ML Manager to take runs out of MLflow, and to be able to deploy them in a database with this [inaudible], enabling you to essentially get lineage in government, where you can go from the prediction that’s being made in the evaluation store, find out what algorithm was actually performed or what was used for that prediction and what features were used, go into MLflow with that run ID and find out exactly what features were there and then dig into those features in the feature store, literally being able to go back to see exactly how that training set was created in a very repeatable way from the creation of the training set in the feature stores automated API. So with that, what we’re going to do now is demonstrate this feature store and this database deployment in action. I’ll pass it to Jack.
Jack Ploshnick: Thanks so much, Monte. In this demonstration, I’m going to show you what Monte talked about, how a feature store can be built on a single HTAP database. So here I’m inside of the Splice Machine platform where you’ll see Jupyter Notebooks, [inaudible] you can track your feature store, but seeing as this is the Databricks conference, I’m going to demonstrate the feature story’s capabilities on the Databricks platform. Certainly HPTAP database can be connected to wherever you’re writing your code, any Jupyter Notebook environment, anything that can accept a JDBC or ODBC connection. Here we’re going to use Databricks. So in this demonstration of the feature store and of the capabilities of an HTAP feature store specifically, we’re going to build a product recommendation engine. This product recommendation engine is going to have two feature sets. It’s going to have a summary of purchases that customers have made over the past day, week, month, year.
And then we’re going to have a summary of items that are currently inside of a given user’s cart. You can imagine they add multiple users to the cart, you’re making a product recommendation, and then whatever is in the cart prior to that point is a feature inside of your model. And the training set we’re going to use in this model is, was the product that was recommended purchased in this session so that you can imagine wanting to look into a further time horizon, but in this we’re just going to stay all within sort of one user session. So we have here is two different feature sets and these feature sets are very different in nature. This first feature set is a batch feature set. This is something that you’re going to update daily, or weekly, or monthly. This is the type of operation that’s coming out of your data warehouse.
And with the Splice Machine feature store, this type of operation can be automatically scheduled and managed. What you do is you specify a source. This source can be any SQL database, as well as any flat table. You could connect to a parquet file sitting in S3, anything like that. You bring that data directly into the feature store as a first-class table. Then you have the ability to schedule your transformations on this feature. What start date do you want the transformations to occur? An aggregation window, do you want this transformation to occur every day, every week, every month? And then you can specify a number of arbitrary feature transformations. You might want to take a particular column and calculate a sum of that column every day or week. And of course in this area, you can do a number of different transformations, sums, averages, standard deviations, whatever you need to do.
So this first feature set a batch feature set can easily be added into the feature store and put them to those two separate tables that Monte was describing. Building these types of complex aggregations even if they’re not in real time, even if there’s something that happens just once a week is not necessarily an easy task. This is the SQL under the covers that the feature store generated in order to populate this batch feature set. In this type of complex SQL, this complex aggregation is what the data scientists would have to write without the feature store. But now you sort of specify some basic SQL aggregations and the feature store takes care of the rest. Additionally, because our HTAP database supports triggers feature transformations can be event driven. They don’t have to be scheduled to occur every week. They can occur whenever a new row of data, a new observation is added to the feature store.
What I’m doing here is simply specifying an empty feature set inside of the feature store. I’m writing some pretty basic backfill SQL just to populate that table. And then I’m going to write a database trigger. And this trigger fires whenever a new row of data is added to the database. This is how we’re able to build that feature set that uses whatever information is currently inside of a user’s cart. Every time a new item is added, the trigger fires and the feature set is updated. So now that we have our feature sets built, we can talk about the first real use case of the feature store which is building training sets. We have our features, what do we do with them? Building training sets especially when you have a real-time use case is not always as straightforward as it seems.
And the reason for that is what we call the point in time correctness problem. Let’s imagine that we want to make a prediction at this time here. We have a given customer ID, we have a given invoice ID, which you can think of as sort of like a cart use, a session ID that the number of items currently in a user’s cart and we have a time that the product recommendation needs to fire. What you need to do to train our model is you want all of the most up-to-date features at this particular point in time. From this feature set, we have this timestamp here, which was calculated several days ago. Whereas in this other feature set here we have this observation when new items were added to the cart, and this was taken just 30 seconds prior to when this observation occurred, these three different timestamps, which are all different still from the time that a purchase might’ve actually been made or what the data scientist needs to wrangle. They need to combine all of these different timestamps in precisely the right way and ensure that they don’t accidentally have data leakage.
They haven’t accidentally included this row here, for example, which occurred after the prediction would have happened in real life. This is a very challenging task to do and the feature store does it automatically. How does this work? All the data scientists has to do is they simply specify their training label. They specify their training label, some primary keys and some join keys associated with that training label and the feature store takes care of the rest. We have this concept of a training view, which operates like a view inside of a SQL database. Whenever this is run, the most up-to-date features are pulled out of the feature store and joined in that point in time consistent manner. And they can then be displayed to the data scientist in the form of a Spark or a Pandas data frame.
Under the covers what are have we actually done here? How did this magic of create training view work? What has occurred? The feature store generated this complex SQL statement, which as you can see has a number of inner joins, outer joins subqueries. This is the type of work that is necessary. Every time a data scientist wants to build a point in time correct training set in a real-time use case, they have to wrangle all of these different timestamps, join them together, it’s a very time consuming and error prone operation, the feature store does it automatically. Now that we have our training set built, we can go ahead and train our model. In this case, we’ve got a nice little H2O model to do our product recommendation. So now that we have our model trained, it’s time to serve features to the deployed model. How do we get features out of the feature store and send them to where they need to go?
You’ll notice here an API, that’s probably very similar to what you’ve seen in other feature stores. We’re able to get features out of the feature store in the form of a Spark data frame, we can get them in the form of SQL, which can then be executed and as you can see, we get features out of the feature store in milliseconds. A [inaudible] would work in a real-time use case of course, as you would just stop parameterize this. You can write a little function, get features out of the feature store based on the primary keys that you need, as you can see in an incredibly short period of time. But what’s happening under the covers here is very different than what you’ll see in other feature stores due to our HTAP database. The data that’s used for training and the data that’s used for feature serving are linked with the asset compliant triggers that Monte mentioned.
So you never have to worry about inconsistency between the two databases. You never have to worry that the data you use for model training is not going to be available to serve, to deploy models. These two tables inside of the feature store are linked with ACID compliant triggers, and you can use a very simple API to get features out of the feature store in milliseconds. As Monte mentioned, the feature store can operate as an evaluation store as well. The very same database powering the future store can deploy models. And this is beneficial because it’s much faster. Data doesn’t have to move from wherever your feature store is to wherever your containerized rest end point is. It’s all self-contained within the same ACID compliance system. It’s very easy to deploy. There’s not another system you have to manage through this unified data lineage and governance store.
The way this works is we can simply deploy a model using MLflow. In this case, we’re in the Databricks environment, we just take that MLflow run ID, deploy a model to the database. What we’re actually doing under the covers as you can see in these logs that printed out we’re making a new table inside of the database and we’re associating a database trigger. Such that whenever a new row of data is added to this database, the prediction is automatically generated and stored in the very same database table. So if we want to take data out of the feature store and make a prediction based on it, we simply do an insert statement from one part of the database to another. We insert the features from the feature store and the prediction is automatically generated and stored in the same table in milliseconds.
Of course, this doesn’t have to be just a single record lookup. This can occur on, in this case, we’re doing 19,000 observations, and this happens in just a few seconds. So the feature store with an HTAP database can power your evaluation store and can power your model deployment mechanism, all in the same system. Finally, putting all of these pieces together, we have an automatic lineage and governance story. We can pull out of the feature story the exact training set that was used in deploy model. We can then pull out all of the predictions that the deployed model has made, who asked for those predictions to be made, when, what model was used to make those predictions, the predictions themselves. And then we’re able to make a plot that looks like this. We’re able to track the distribution of features that the model was trained on and compare that to the distribution of features that the deployed model has actually seen.
So finally, putting everything together, what myself and Monte I’ve talked about today is a complete unified architecture. The HTAP database can modularly fit into whatever architecture you have and solve all of the missing components that are necessary for real-time machine learning. So with that, thank you so much. And we will happily answer any questions in the chat.
Jack Ploshnick is a Customer Data Scientist at Splice Machine. His work focuses on using analytics to support the sales and marketing teams, as well as onboarding new customers. Prior to Splice Machin...
Monte Zweben is the CEO and co-founder of Splice Machine. A technology industry veteran, Monte's early career was spent with the NASA Ames Research Center as the deputy chief of the artificial intelli...