Accelerating the ML Lifecycle with an Enterprise-Grade Feature Store

Download Slides

Productionizing real-time ML models poses unique data engineering challenges for enterprises that are coming from batch-oriented analytics. Enterprise data, which has traditionally been centralized in data warehouses and optimized for BI use cases, must now be transformed into features that provide meaningful predictive signals to our ML models. Enterprises face the operational challenges of deploying these features in production: building the data pipelines, then processing and serving the features to support production models. ML data engineering is a complex and brittle process that can consume upwards of 80% of our data science efforts, all too often grinding ML innovation to a crawl.

Based on our experience building the Uber Michelangelo platform, and currently building next-generation ML infrastructure for Tecton.ai, we’ll share insights on building a feature platform that empowers data scientists to accelerate the delivery of ML applications. Spark and DataBricks provide a powerful and massively scalable foundation for data engineering. Building on this foundation, a feature platform extends your data infrastructure to support ML-specific requirements. It enables ML teams to track and share features with a version-control repository, process and curate feature values to have a single source of centralized data, and instantly serve features for model training, batch, and real-time predictions.

Atlassian will join us to provide first-hand perspective from an enterprise who has successfully deployed a feature platform in production. The platform powers real-time, ML-driven personalization and search services for a popular SaaS application.


 

Try Databricks

Video Transcript

– Hi, I’m Mike Del Balso and today we’re gonna talk about accelerating the machine learning lifecycle with an enterprise grade feature store. Giving this talk with me is Jeff Sims, a principal data scientist at Atlassian. We’ll hear from him later in the talk. And before we get into it, I’d like to mention a little bit about my background, I’ve been involved in ML Ops for a number of years, I was working as a product manager at Google on the core ML models that power Google’s Ads business. These models were super important and also had great ML Ops processes around them to keep them accurate, safe and trustworthy. I later joined Uber where I helped create the first centralized machine learning team and created the Michelangelo machine learning platform. Today I’m the CEO and co-founder of a company called Tecton. And we’ll talk about Tecton in this talk.

Machine learning today falls short of its potential quite a bit, many companies are investing big and machine learning. But these efforts just aren’t paying off. There’s a couple of reasons for this. There’s just a limited signal it’s hard to find the right data, it’s hard to extract the right signal from it. There’s really long development cycles, applied ML projects often take more than a year, and there’s no path to production. It’s very painful to get a machine learning project past the finish line into production and productionizing an ML project can take multiple teams, a lot of resources and a lot of re implementation. These pains are particularly pronounced in high value applied use cases for machine learning. We call these operational machine learning. Operational ML is how we refer to applied ML projects that actually drive user experience or live business processes in production. So some examples of this are fraud detection, click through rate prediction, real time pricing, dynamic pricing, customer support. And these use cases tend to be really high stakes, I have customer facing impact, be time sensitive, sometimes have a real time component, production SLAs, be subject to regulation and have a number of different stakeholders.

Building operational machine learning applications is quite complex and data is at the core of that complexity. This is a now famous diagram from a paper published by actually my old team at Google. Its purpose is to illustrate that machine learning code is quite small in the grand scheme of everything that needs to be done to put a machine learning model into production and operate it correctly.

When we look at all of these components around the machine learning code, we can see that data and feature management is a dominant challenge in most of these boxes, it’s the majority of the work that goes into building and deploying these machine learning applications.

So features are the inputs that we use for machine learning models. They’re the signal that we extract from our data and are critical to a machine learning application. Many teams invest a ton of time into feature engineering, and that data is extremely valuable, for many industries where models are core to the their business, the features themselves are the core IP. Teams that excel with applied machine learning have ways to share features and reuse them and this lets them get a lot of leverage out of the features as individual building blocks for machine learning applications. Andrew Ng said that applied machine learning is basically feature engineering. So that speaks to the importance here.

Unfortunately, there just aren’t good enough tools to help teams build and manage their features today. When we’re building a machine learning application. We have a healthy environment of DevOps tooling to support the development and deployment of application code, and at the model layer, there’s an emerging set of tooling to support model development, and model management. This tooling helps us train models and serve models in production.

The big gap today is that the feature layer, tooling is needed to help teams manage the deployment and versioning these features, share these features to standardize resources, and deploy and operate them these feature pipelines in production.

So this is where Tecton comes in. Tecton is a data platform for machine learning. it’s a platform built for managing features as valuable software assets through their complete lifecycle. It’s intended to be a control plane for the features and the data using an ML project. So at a high level Tecton connects to the raw data sources of the business via batch streaming or real time data, then transforms that data with feature pipelines and organizes feature values in feature stores.

And then continually computes fresh feature values and serves those feature values to models in production for inference.

production models interact with Tecton by fetching feature data vectors in production. And data scientists interact with Tecton by defining new features and extracting training data from it. Let’s look at that. So when a data scientist wants to build a model, they need training data and they actually fetch training data from the same system and they fetch that training data through a Python SDK in a notebook, for example, like a Databricks notebook. When data scientists want to define features, they write that source code for feature pipelines and then roll out That feature configuration to production using Tecton CLI, which is best combined with a CI/CD pipeline.

Finally, users can browse feature data and metadata in a metadata catalog for their features through a web UI. So this platform is made for data science and engineering teams, especially to have faster development cycles, lower time to production, lower operational costs, and easier adoption of machine learning across teams.

So the Tecton platform is focused on solving a few key problems. The first is helping teams manage a sprawling and disconnected set of data pipelines for machine learning. solving this problem was a key enabler for scaling machine learning at Uber with Michelangelo. Secondly, building training data sets from messy data.

Finally productionizing feature pipelines and deploying them as part of a machine learning application. Let’s start by looking at the first challenge.

Challenge 1: Managing sprawling and disconnected feature transform logic

Models are made with tens, hundreds and sometimes thousands of features. Each feature has a corresponding data pipeline that connects to the internal data sources and transforms that data into the feature values needed by the model. There’s a lack of standardization in the tools used for these workflows. And it leads to silos between teams, one feature might be implemented as a few lines of code in a Jupyter notebook. Not discoverable or reusable to anyone else in the team of a company. When another person building a separate model needs to build a similar feature, for example, the seven day click count feature. They may not have any idea that this feature was previously implemented. The data scientists will then spend a bunch of time building this feature all over again on their own, essentially re implementing this feature in their own silo, and possibly on their own tech stack.

And then when another person wants to build, say a retention model and wants to use a similar feature, maybe they do know, maybe they can find the previously defined feature data or existing feature pipeline code from the other two implementations. Ideally, he can reuse the output of those pipelines, the computed feature values. However, he doesn’t have any guarantees about these pipelines. First of all, which one should he use? Secondly, can he realize ML application on this pipeline? Is the creator of this pipeline planning on changing it next week and therefore sending different data to his model? Are they gonna deprecate this pipeline next week? This pipeline breaks, who’s on call for it? Well, I know how to debug it. safer to just implement my own implementation.

And this is how pipelines brawl happens. Even for stable productionized pipelines, there aren’t ways to manage and share them. Features are some of the most highly curated and refined data in a business yet there are some of the most poorly managed assets. Again, this is just for one feature. different models have hundreds or thousands of features and this is a core friction that prevents teams from collaborating and companies from scaling machine learning internally.

Solution | Tecton centrally manages features as software assets

So Tecton provides a standardized way to define features and share them across an organization, bringing re usability and best practices to data science development. It takes an approach in managing features as both the feature data and the feature transformation code used to generate it. Tecton allows a user to define features formations using different data processing frameworks like Pyspark or Python, and persist these implementations centrally for reuse across use cases. Think of an organization wide library of features. Finally orchestrate the regular execution of these feature pipelines to generate updated feature values. It’s as simple framework for data scientists to easily author feature logic and have the versioning dependency management and monitoring taken care of for them automatically. So how do features actually work in Tecton? And how does a data scientist define them?

feature transformations in Tecton are written in manages code. So features are defined in Python files in your local code repository. The definitions are made up of the transformation code and other metadata that tell Tecton how to orchestrate the execution of these transformations and share them with others.

Different transformations can be defined for execution on different processing engines, for example, Tecton may execute a Pyspark based feature on Databricks and execute a SQL based feature within the warehouse itself. When this feature is defined, to save it to a feature store Tecton has a terraform like interface to simply apply the locally defined feature pipeline configuration to the production Tecton cluster. When a feature is in the feature store, it’s fully productionized. And I’ll say more about that shortly. This terraform like CLI allows features to be managed, like software assets with standard DevOps workflows, and infrastructure like your CI/CD system. So that’s it, manage your feature definitions and get, sync your feature definitions with the Tecton cluster with a CLI command.

When a feature is in the feature store, it’s visible and discoverable, through essential UI. So the UI groups features tracks metadata, owners, access policies, and provides an overview of feature stats. And remember, Tecton doesn’t operate the models themselves. But running these pipelines responsible for delivering feature data to these models allows it to track the end to end data lineage from data source to prediction.

The second area of challenges is is encountered when we’re trying to build high quality training data sets from messy data. common challenges here are around creating a single data set from a variety of pipelines maybe authored by different people or that depend on different upstream data sources. also getting historically accurate training data with point in time correct values, preventing label leakage and getting that training data to the various training jobs. Tecton addresses these challenges With a simple API to generate training data.

Solution | Configuration-based training data set generation through simple APIs

So Tecton has a configuration based training data set generation with a simple API. And because feature pipelines in Tecton are standardized and all meet a specific data contract Tecton can make use of intuitive higher level abstractions when combining features together in generating a training data set. To bring things together into a training data set. First, there’s a simple SDK to configure the features that we wanna use in our model. The CTR models features object here represents the group of features that we want to use to generate training data. Then in an interactive Python environment, for example, the Databricks notebook, a single line of code will generate a full historical training data set using features from the feature store.

This function simply takes a list of historical events that we want to generate feature vectors for, like page load events, and returns a spark data frame with the feature and label values that you can use to directly train your model.

Now, both training data and the configuration of the job that created it are persisted for reproducibility and later retrieval. And we can always come back to this API and use it again to retrieve an updated training data set using updated feature values. So let’s look at this last step, and more at the data level to see what’s going on with a time travel component.

Solution | Built-in row-level time travel for accurate training data

A training data set needs to reflect feature values that represent what the model would have seen if it were making a prediction at a certain point of time in the past. We’ve defined a set of features that we wanna use in our training data set, but which historical values should we use to serve these features? For some problems, when we want a regularly scheduled prediction, something like a daily value is fine, in which case it’s pretty easy to generate a feature value every day. For other cases, we wanna generate a prediction at a specific time in the past and an event like a transaction time, purchase time, or a login time for a fraud prediction, or even a page load event for a recommendation. So we want our training data to represent the feature values that existed at each of those page load events in the past. This is typically a huge hassle and can involve complex backfills, hard to catch mistakes and complicated joints. One of the magical parts of Tectons API is that it just asks you to provide a list of historical events. This is super simple. It’s literally just a data frame of user IDs and event times. With that Tecton expands the data frame and fills in the table with the right historical values for every feature. It does this using time travel, and this structurally prevents information leakage in your training data set. This is super important and hard to do with event based data.

Behind the scenes Tecton’s training data generation reuses precomputed feature data as much as possible, and it uses your Spark cluster to generate these values.

Now we come to the final set of challenges that we’ll talk about. These are the challenges encountered when we’re moving to production, and encountered when we’re moving beyond batch to say real time inference. Putting models in production is super hard, but it’s even harder if we wanna use real time predictions or even worse using real time or streaming data. Challenges include getting engineering to rewrite our code in a production environment, training serving skew, provisioning the right serving infrastructure and making the right trade offs between cost and data freshness and quality monitoring.

So let’s look at an example, When we’re building a model, we typically start by just building a training pipeline.

Challenge 2: Today, moving to production requires reimplementation

If we wanna move this model now to production.

for example, if we wanna make live recommendations or detect fraud in real time, we need to connect to the production data sources, run the feature pipelines in the production environment, deliver the feature values to the model for prediction.

And again, these use cases often require that this all runs a low latency meaning less than one hundred milliseconds. So we can no longer do something like spin up a spark job to generate these features at prediction time. A common but dangerous solution is to re implement this pipeline in the production environment.

If you’ve ever had an engineering team re implement your Python code in Java, this is what we’re talking about. Besides taking time and being costly to implement, these three implementations are error prone, and they can be close, but they’re very rarely the exact same. When a feature pipeline implementations vary slightly, they can introduce train serve skew.

Differences in the data can make models behavior erratically, and destroy model performance. The way to solve this is to reduce duplicate implementations as much as possible.

Solution | Unified train/predict pipelines ensure online/offline consistency

Tecton solves this by using its pipelines, both to generate feature values offline and online for predictions. It uses a single implementation for every feature pipeline. When features are calculated in batch, for example, on heavy pipelines like a spark job, it automates the regular execution of the feature pipelines and loading of those values into a serving layer for predictions. real time features, real time computer features have different data sources in production versus the dev environment.

Solution | Tecton delivers those features “online” for real-time predictions

Tecton provides a unifying abstraction over these multiple data sources to allow the data scientists to develop against a single virtual data source and Tecton handles automatically using the right data source depending on the execution context. Meaning depending on if this is a training or prediction job, depending on different data sources is not desirable because it introduces another opportunity, a new opportunity for differences across these pipelines, but it is sometimes unavoidable. Tecton monitors the data distributions across the training and prediction pipelines to generate alerts if data starts to diverge.

All features defined in Tecton are productionized like this out of the box. This means that with it with Tecton, it’s now trivial to go from a couple of lines of spark code to a production hardened API for retrieving features in real time at scale in just a few minutes. This allows data scientists to put their features in production without depending on anyone else, for re implementation or deployment. Data scientists can now truly own their work in production, because they can now safely deploy to production on their own.

And so the training pipelines will serve millions of rows and access training data through data frame API while at prediction time, we’re pulling a single row of data and we’re doing it in milliseconds and that’s all done through a REST API. So bringing this all together once again, Tecton manages the full feature lifecycle for machine learning. It allows data scientists to define and contributing great features to a common repository, and reuse and assemble those features into high quality training data sets to build great models. It calculates fresh values of those features and serves the updated values for predictions in the production environment. One of the companies that has been using Tecton extensively over the last few months is Atlassian. And I wanna hand it off to Jeff at Atlassian to let him talk about how Tecton has enabled machine learning to scale at Atlassian. – Okay, thanks Mike.

Example Automated Content Categorization in Jira

So I’m from Atlassian. We make products such as JIRA, Confluence, and Trello. And these are used by thousands of teams worldwide. And our mission at Atlassian is to unleash the potential of every team. And be part of that capability is using the wealth of knowledge that we have about our users to build smarter and more delightful experiences. You can see here the screenshot we have on the screen now is a screenshot of our issue tracking system called JIRA. And there’s a bunch of things on the screen that you could imagine we’d like to predict. For example, what’s highlighted are the labels of the ticket, you know, these might correspond to what subject matter is this person talking about? Or what ticket this is about? We can also see the assigning, who would you like to assign to this ticket? maybe you’d like to assign someone that you recently or frequently use? Or maybe you’d like to assign someone that regularly works on these sorts of tickets. So plenty of things here that we can think about in the machine learning space. Now, if we think about the scale aspect of Atlassian we’re very beginning company with hundreds of thousands of customers. So each single day we collect billions of events from our products. So every time any action is performed in any one of our products and event is fired, and we record that information.

Now, even if we develop simple models, for any given model, we can have hundreds of millions of key combinations. So different combinations to each feature that are stored, that we need to have accessible at any given time in case we need to make a prediction but we also need to update each one of those hundred million combinations anytime a new event gets fired in because we always want the information to be up to date, which is a huge challenge. And then also each day between all the experiences that we already power, we’re generating a billion individual predictions. So we need to have features that are updated in real time to make a billion predictions to power many many experiences each today. So we’re talking about a huge scale challenge for us with respect to even simple models in the machine learning space. So I’d like to tell you about what things looked like for us before we use Tecton and then what things look like afterward. When I joined the team that I’m currently in now building these experiences in product backlog machine learning, there was a lot of confusion around what features we were using. I couldn’t even find out central definitions, I couldn’t find sources of truth for particular features. Of course now in the Tecton world as you’ve heard, each feature is individually defined independently and for us we actually link these to a git repository. So we have really nice central definitions per feature, which removes all ambiguity from the situation. We used to have really long and complicated SQL and Pyspark workflows offline, just to generate all these different features to join onto our model, we needed to be careful not to introduce leakage into the model. Whereas now in the Tecton world, it’s literally one line of code to join all the features onto a given training set and they handle all the time travel, and they prevent the leakage all internally. So no longer does the data scientists have to worry about that. It used to take us months of interfacing, with experienced engineers to implement a given feature in a streaming fashion. Now, that often doesn’t make sense if you haven’t had a lot of experience with streaming features, but doing stuff in streaming is a lot different to just calculating things in batch. but of course now, the data scientists can directly generate features and put them into production with no help from any engineering at all.

One of the most important things, I think before even if we got our long and complicated workflows working and we did manage to put everything into production, there was still no guarantee that what we’ve done offline was gonna directly match what we had or what we were serving online, it is an extremely hard thing to get right at scale. Of course, one of the whole premises of Tecton is that they handle this for you, whatever you do offline is guaranteed to happen online. So that’s one of the really nice things.

In terms of actual results, before we used to have an in house feature store, and it was pretty good, you know, this is a really hard problem to solve. And we were about 95 to 99%, by accurate by accurate I mean, what proportion of our keys? What proportion of the hundreds of millions of feature keys that we were storing, were the exactly right? if we were to go and compute what those values should be offline. So 95 to 99% is pretty good, we had two to three full time people looking after that service. Whereas now after a huge kind of pilot process and validation exercise, we’ve got Tecton running, it’s 99.9% accurate across all of our features and only minimal work from us is required to really interface with that, to look after it. So this is a really phenomenal improvement.

I’m gonna think about like actual impact to our customers. Well, we started off with a pilot process onboarding Tecton with one model, and within a couple of months, we had three models running in production, and that’s the current state. So whenever you mentioned someone on Confluence, whenever you mentioned someone in JIRA, or whenever you assign someone a JIRA ticket, both models are powered by the Tecton platform, and we also have two other models, which are the label and component predictions in the JIRA issues. They’re being onboarded as we speak.

As a net result, like just moving over our existing model onto the Tecton platform, each model we’ve moved over has seen an improvement in accuracy just due to more accurately calculated features. But as a net result, if you look at all the accuracy improvements over everything that we’ve migrated, we’re talking about over 200,000 improved customer just as a result of using the Tecton feature store. So this is an absolutely monumental improvement for us, we’re really happy with that. And I’d really like to recommend that anyone trials Tecton, or talk to the Tecton guys, for anytime you’re really doing streaming or at scale machine learning, because that’s really where Tecton comes into its own. So thank you very much, and I’ll hand it over to my Mike. – [Mike] Thanks Jeff, and thank you for watching.


 
Try Databricks
« back
About Mike Del Balso

Tecton.ai 

Mike Del Balso is the co-founder of Tecton.ai, where he is focused on building next-generation data infrastructure for Operational ML. Before Tecton.ai, Mike was the PM lead for the Uber Michelangelo ML platform. He was also a product manager at Google where he managed the core ML systems that power Google's Search Ads business. Previous to that, he worked on Google Maps. He holds a BSc in Electrical and Computer Engineering summa cum laude from the University of Toronto.

About Geoff Sims

Atlassian

Geoff is a Principal Data Scientist at Atlassian, the software company behind Jira, Confluence & Trello. He works with the product teams and focuses on delivering smarter in-product experiences and recommendations to our millions of active users by using machine learning at scale. Prior to this, he was in the Customer Support & Success division, leveraging a range of NLP techniques to automate and scale the support function.

Prior to Atlassian, Geoff has applied data science methodologies across the retail, banking, media, and renewable energy industries. He began his foray into data science as a research astrophysicist, where he studied astronomy from the coldest & driest location on Earth: Antarctica.