AI @ Scale: Industrialized ML with Databricks

May 27, 2021 12:10 PM (PT)

Delivering AI solutions at an enterprise scale involves bringing together a variety of perspectives to augment an organization’s data science capabilities. Accenture describes how to industrialize your ML applications to accelerate model delivery and optimize the data science workflow while maintaining the highest standards of model governance. Participants will take away an understanding of how ML Engineering can help you industrialize your ML pipelines and what is involved in the management of the end-to-end ML pipeline (not just model management). We will demonstrate how the Databricks Lakehouse platform is an ideal tool for delivering ML pipelines at scale.

In this session watch:
Nathan Buesgens, Data Architecture and ML Engineering Consultant, Accenture
Atish Ray, Managing Director, Accenture Applied Intelligence, Accenture

 

Transcript

Nate Buesgens: Hello, I’m Nate Buesgens and I’m a data architecture consultant and an ML engineer.

Bryan Christian: Hi, I’m Bryan Christian. I’m a data scientist. I lead the data science function for mission data at Navy Federal Credit Union.

Nate Buesgens: And today, we’re going to be talking about feature stores, what’s involved in the scope of a feature store. We’re going to talk about what’s involved in the scope of a feature store. Some of the core use cases to service both a data scientist and also an ML engineer, and what those sort of core use cases are that allow us to achieve 80% of our value with 20% of the effort.
We’re also going to compare this to some of the ways that we’ve conventionally managed data, especially in a data warehouse sort of context. And we’re going to look at how those differences drive our design in terms of the way that the scientist will access the data and also how we physically represent that data in a Delta Lake.

Bryan Christian: All right. So I’m going to start us off with what is a feature store? So the feature store is what is going to be enabling us to achieve data science at scale and we do this through MLOps or ML operations. And there are a few examples I would like to give. So one is that you see roughly a 15 times acceleration in terms of model delivery with MLOps, both due to the automation and the governance that you can also automate around these models and how you’re deploying them in the metadata you’re collecting.
A huge bottleneck, as many of us know, in data science is data wrangling, actually getting your predictors and the state that you are happy with such that they will form accurate predictions for whatever you’re actually modeling. That’s actually a huge component of the work. And so, what this does is that the feature store actually enables a 75% reduction, more or less what we’ve observed in the feature engineering time for data scientists.
As well, the way the feature store works is it’s giving this end-to-end value delivery. It’s something that the engineers can use machine learning models into production, but also expose it to the data scientists to accelerate the very building off the models as well. This does reduce the time to value in the concurrency and as well, you have the scalable infrastructure because now the data scientist is building their models, they’re using the exact same productionalized elements and features that the engineer will be using.
And the overall goal here is to really avoid this proof of concept factory, where you might have a proof of concept that never really gets to production. And you just have proof of concept, after proof of concept, after proof of concept. And so by enabling the data scientists in this fashion, we’re able to then not only have models that make it to production, we’re able to do so much more quickly.

Nate Buesgens: Yeah. And I would just add to that what Bryan was saying about the proof of concept factory, that’s really at the core of what we’re doing with ML operations. We’re trying to take those lessons that we learn from each deployment and codify those lessons and bake them into our infrastructure, and the feature store is a great example of one of the ways we can do that.

Bryan Christian: Yeah. Thanks Nate. And so, let’s actually dig under the hood and talk about what that bottleneck looks like. So the feature store serves as the consumption layer for ML applications. So, it provides acceleration through these pre-hardened features that reduce that data wrangling time, and also this governance through a common consumption pattern that ensures nothing is really lost in translation.
So let’s look at a common design pattern that you might see before a feature store. So let’s say you have your curated data available in the data lake, and you go to do some feature engineering. Let’s say I’m going to build a model in financial services where I need to know someones average checking account balance over the last 30 days. I’m going to feed that into my model and then that’s going to serve with some predictions. Now let’s say I have another data scientist who comes along and they also need to use that exact same feature.
And they need to engineer that feature and wrangle it on their own, that average 30 day checking balance, and then build a model. And then yet again, another data scientist would come along and need that same feature and have to build their own models. So here you’re seeing this bottleneck because you have the same feature that these data scientists are after to actually feed into their models so they can get your predictions.
And so the governance aspect is multi-pronged because it’s not only that there’s a single definition, it reduces the manual coding and the sort of human error that can occur as you’re actually coding the individual features, which would need to be cleaned up in production, as well it makes it a consistent definition for many of these features. And you can reduce your bottleneck by having this feature store service this feature engineering hub, which then passes along that capability to the data scientists who actually build their models from a common story features. Anything else that I can add Nate?

Nate Buesgens: No, it looks great, Bryan.

Bryan Christian: All right. So let’s talk about some data science use cases. So the feature store itself is built on point in time and the level of the prediction in terms of the granularity. So that means I’m going to make a prediction about a person, I’m going to have the granularity of a person, and then points at the time. If I’m making the prediction about an account, is this account going to overdraw or something along those lines, then there is the point of the prediction or level of prediction which is the account and again, that point in time.
And that point in time aspect is extremely important. And I’ll talk more about that in a second. As well, you have these correct inconsistently applied joins across multiple Delta Files without the loss of processing speed. And so, rather than having data scientists cobble together multiple Delta Files and do all these joins on their own. And then every time you run each of these features in production, you have to run all those joins over and over and over again, you have a single place where these joins are happening.
We’re going to talk a little bit more and a few slides about the SDK that we built that actually dynamically allows these joins to happen, which even greater increases the speed to delivery, because you’re not creating all of your joins if they’re not necessary. Finally, you have aggradations of window functions and transformations to the data. This part is really where you’re forming the actual features.
And a key element of this granularity, I mentioned the point in time before, and this is where it really differs from a snapshot, is that these window functions are allow future and backwards visibility as of the point in time. So let’s take the example below where let’s say we have a customer ID and that’s the level of the prediction. And then as of date, let’s say May 1st, and we have the first two features would say, it’s the same feature, but it’s measuring what’s happened zero to 30 days ago and 31 to 60 days before that.
And you have some continuous variable output there. Now that’s backwards looking window functions. You can do that with the most recent data, but with the point in time, what you can also do is, let’s look at the future 30 days, what happens in that next 30 days? And oftentimes, these next 30 days, these future facing windows will actually become the target of your predictive model when you’re doing time series analysis.
Like who’s likely to overdraw in the next 30 days? That would be an example where in the feature store, you would have predictor inputs, which could be analogous to the left that you see, and the target would be essentially the one on the right, which is feature basic. Now we have some metadata that’s actually embedded in the code itself, which the SDK will allow us to control which features we’re actually going to be passing into our models. And this really prevents that feature leakage that would really be a huge concern otherwise.
Especially when you get up to thousands of features, knowing which features might be future facing or backwards facing, but this metadata will also allow you to pull out data you might not want such as data from a third party or PII data. And so, this metadata, which we’re going to talk about in a little bit, will tell you how we’re solving, not just this feature leakage problem which you could imagine here as well as many other metadata related issues.

Nate Buesgens: Yes. So what Bryan’s gone over some of the core functionality of a feature store and the sort of functionality that’s going to get you 80% of the value that we’re looking for from a feature store with the least amount of effort. But we also didn’t want to overlook some really important additional functionality, which often gets baked into a feature store, but may address slightly more niche use cases.
And one of the most common examples of that would be online feature serving. So one of the things we do with a feature store is you try and maintain consistency between the data we’re using in our production or in our training environment and the data that we’re using in our prediction environment. And that can be especially complicated if for when we’re making predictions, we have requirements around ultra low latency, ultra timely features, and if we have a use case where we are making point reads, which would differentiate it from a streaming solution, for example.
So we wanted to highlight some of the types of machine learning use cases that you can achieve with this sort of like 80-20 solution that don’t necessarily have that online feature serving requirement. And a lot of those fall into this category of machine learning applications that you could describe as human plus AI. So for example, if you are optimizing content on a digital media website, some of those use cases, some of those machine learning applications are going to be online use cases.
You’re going to have a machine learning application acting as an agent. The AI is acting as an agent in order to make decisions, for example, about what thumbnail to put in front of a video. But where you might also find a lot of low hanging fruit for machine learning applications would be an applications that help curate data for an editor of that media website, where the editor is still acting as the decision-making agent, but you’re helping them curate and you’re providing that editor with more telemetry, and that would be a human plus AI solution.
And those types of solutions will tend not to have the same sort of online requirements. And they’re an often overlooked application of machine learning where you can get a lot of low-hanging fruit. Another thing that often gets roped into the feature store implementations would be optimization of the workflow for developing ETL pipelines. And there’s a lot of interesting work happening there and we just find that where we get the most value is optimizing the way that we access that data versus optimizing those ETL pipelines.
So now we want to talk about how this can be different than some of the ways we’ve conventionally managed data through data warehouses and through dimensional modeling or star schemas, because there are some similarities. A feature store is similar to a data warehouse is sort of where we keep our golden aggregates of curated data. The data tends to be highly structured and has a lot of very similar non-functional requirements around governance standards or metadata management or discovery.
It’s very typical to talk about catalogs of features that we can select from in a similar way that we would talk about catalogs of data that we might want to explore for a BI use case. But there are some fundamental differences in the way that we access the data between a BI use case and a machine learning use case, and those access patterns drive different data models, those different data models potentially drive different technology stacks.
And really what it comes down to is that when we’re doing supervised learning, it creates these really nuanced and complex requirements for point in time accuracy of our data. So because that’s such a key requirement that drives so much of our design, we’re going to drill into that and talk about some of the ways that inconsistency can sneak in if we’re not careful about maintaining this point in time accuracy.
And the first way that that inconsistency can sneak in is through window functions. So window functions are certainly not unique to feature stores. They’re part of the standard SQL syntax, but they are one of the more complex parts of standard SQL syntax. And we have applications within data science, which sort of exacerbate that complexity. So for example, it’s not uncommon for a scientist to say, “I have a 100 base features and each of those 100 base features, I want to window in 10 different ways.
I want to see that feature aggregated over the last 30 days or last 60 days, or I want to say aggregate weekly or monthly,” or I may want to look ahead and create a target variable for 30 days from now. Those are just all add complexity to the implementation of window functions, which makes them a common area where just coding errors can sneak in or something can get lost in the translation from development to production.
And data warehouses are certainly capable of serving the outputs of our window functions, but one area where a dimensional model can often times fail us when we’re doing feature management is, when we’re trying to address this problem of feature leakage. So there’s precedent, there are examples of fact tables in our dimensional models, which are periodic. But where we can run into trouble is if we are joining those periodic fact tables to dimension tables, which are not periodic.
That can be an example of where data from the point in time of our target variable can sneak into the data that we’re using for training, and that can cause feature leakage. It’s one example of how feature leakage can occur if you’re not careful about how you’re building those ETL pipelines. And then solving those challenges is especially problematic when at the same time trying to address the challenge of watermarking.
So if you go to the next slide, we’ll share a visualization from that highlights, this is from the spark documentation, it highlights some of the challenges we run into when there can be broad inconsistency between the event time of our features and the processing time when we process those events. So some of the data that we’re creating features out of might be coming in a streaming context and updated very rapidly while others may be coming in weekly or monthly batches.
And rationalizing all those different data sources while at the same time maintaining consistency in our window functions and ensuring there’s no feature leakage, is really addressing all these problems at the same time, which creates a lot of nuanced complexity. So what we’d like is an interface for the scientist, an access layer for the scientist or a conceptual model of the data that lets them sort of hide this complexity behind that interface. We want to encapsulate that complexity. So first, we would like the model for accessing the data to be first based on the entity that they’re trying to model.
So if a scientist is trying to model members at a bank or at a credit union, then that would be one data set. Whereas if they’re trying to model accounts at a credit union, that would be a different data set. And then rather than asking the scientists to do all this complicated and expensive point in time processing to join all of our different data sets, we’d like that to be part of the granularity of our conceptual model of the data.
So we’d like a scientist to be able tell us, “Here’s what I’m trying to model, here’s a day or an hour at which I would like the features to create that model.” And what we’re going to show you today is an implementation where we have a very discrete granularity. We sort of force the scientist to select an hour or a day, although there are options which would allow you to select on a more continuous time scale. And that might be important again, if you have a focus on online feature serving.
In terms of the columns of our data model, that’s going to include the features and we serve the features in an unvectorized way. But again, that’s an opportunity for optimization if you really are trying to achieve very low latency and optimize the performance of your feature serving. Our feature store also includes our target variables and our predictions simply because they exist at the same granularity as our features and we have many of the same governance challenges around managing that data as we do our feature data.

Bryan Christian: All right. So now, let’s talk about this SDK that we’ve built to actually enable the scientists to use feature store and the ML engineers as well. So this SDK index is the available features and it builds the feature store dynamically as needed and that’s a production grade feature selection tool, which means the data scientists can use it and then if they are building their notebooks and hardening them to a handoff standard, then the ML engineers can use that same code, which is relying on the SDK to deploy the actual models into production.
So a few key elements is, there’s no need to rebuild the whole feature store when new features are added. So certain sets of features can be rebuilt as needed, but then the entire feature store doesn’t have to be down while you’re selecting on individual features to actually rebuild. The part that I’m particularly excited about as a data scientist is the key word, searching, which really enables features we can look for when using human logic.
I’m going to walk through some examples of that coming up. And then the tuning can be specific to each set of features allowing for optimal feature creation. So the core functionality that we are going to talk about are these commands, find, select, and select by. So I’ll walk through each one of them individually, with some examples. Let’s look at find.
The purpose of find is to really search through all the columns of the metadata within the feature store without actually having to execute the joints and pull the data through that will cause so much processing. So with this, you can explore the features without having to go through a data for EM, thousands of features only looking at the individual feature names, such as in the schema. Here you can actually look at both the feature names as well as the metadata associated with them.
So this particular command has a number of different arguments that you can pass through it. You can see things like regular expressions, you can exclude or exclude different types of keywords. You can be case sensitive to true or false. And so, here’s an example of setting some keywords. You can call the feature store with an fs command, and then you can use .find. And then let’s use a regular expression where we’re going to look for keyword ASDF in QWERTY.
You can imagine these could be anything that you might associate with any of the features based on your particular business case or your business needs. So for me and Navy Federal, most of that are financially related, but for the purposes here, you can see it’s generalizable pretty broadly. So you’re going to return a list, it’s going to say something along the lines of, your search returned 20 results, and it’ll give you the feature name as well as the metadata associated with it.
So notice in the first one listed feature name one, neither ASDF nor QWERTY actually appears in the actual feature name. So it’s different than actually when we’re doing a schema search. That’s actually going into the metadata itself. If you see the comment flag if ASDF is greater than 0.3 at any point in time. So, that’s what’s causing that to return is that keyword ASDF here popping up.
Whereas, if you look at that second entry, you see feature named QWERTY one, you see that the keyword is now in the feature name where it does not actually appear in the metadata itself. So this allows you to dynamically explore the data before you even start calling the data. And as a data scientist, the value is that I can actually have very deep visibility into what’s available in thousands of features without having to scroll manually through everything that’s generated through the scheme output itself, which can become very tedious.
So if I want to know something about credit card balance, I could type in credit card and balance, and with the appropriate combination of how I want to look at that, I could pull that in. If I only want to look at balance, I could then look at generically balance from loans, credit cards from checking accounts, et cetera. So you can see the power and the versatility of this, just to even understand what’s in the feature store.
But more than just wanting to know what’s in there, you eventually want to use it. So we have the select function. And the idea of the select function is it’s actually going to return a data frame with the granularity of whichever feature stories you’re calling. So in this case here, a customer ID and an as off will always be present and then whatever selected features. So you need to return an argument or provide an argument of the date and then a list of the features.
So you can use a date, which you call latest, which is always just could be most recent data available, or you could actually use some operators, which is greater than, or less than, or equal to, and provide a specific date such as May 1st. And so, the syntax, you can see an example below where our data frame name equals feature store fs dot select. Here I’m giving latest. You can type in comments as well.
And I’m going to list off these same three columns that I had listed earlier when describing the feature store initial. And if you display that, you’re actually going to get a data frame that is stored in memory, and that’s going to be useful for you to consistently call out the same features. This is especially helpful for data scientists when you are hardening your data science output, and you want to provide a consistent list of features that are coming from the feature store every single time.
And that will be executable in the production grade notebooks. The reason is if you were using keywords or something like that, as you add more features, you could actually have additional columns be inserted into your model. Now that’s probably what you want to do for exploration. And so, what we have is this third method, which is select by, which is exactly what I was describing. Probably not what you want to include in your handoff notebook and hand off code to an engineer to put the production.
But as you’re exploring what’s actually predictive, this helps you pull in those vast array of features based on the same sort of find functionality we talked a few slides back where I have now combining the date argument essentially. So I’m going to generate features, I’m sorry. I’m going to generate a data frame from this date. And I’m going to use these arguments for the metadata to explore the metadata to return specific columns. Now this will grow and shrink based on what’s actually in the feature store.
So this is more for exploration, but it’s a consistent way by which you can explore as you’re finding out what is actually predictive, because you probably don’t want to throw the entire feature store at a model not because of multi-collinearity and all sorts of other processing problems, but I mentioned data science concerns. But you can have a smaller consideration set as you’re beginning to tackle these feature selection challenges which will happen in data science. Nate, anything to add on the SDK?

Nate Buesgens: No, it looks great.

Bryan Christian: All right. Off to you.

Nate Buesgens: Yeah. So now we’re going to talk about how we’ve implemented that in a Delta Lake and it’s similar to the way we might build an ETL pipeline for dimensional modeling and that this is a golden set of data that we’re keeping in our Delta Lake. And then, depending on your niche requirements, you might end up mirroring that like we might end up mirroring our dimensional model to a data warehouse if we had a lot of high concurrency requirements.
If you have a lot of low latency requirements, you might find yourself mirroring your feature store to a low latency memory cache. Now the next slide will show how the SDK overlays on that to provide a consistent view. If you do have your data mirrored between two technology stacks, it also gives us the opportunity to separate this logical model that we’ve exposed to Bryan from the way the data is physically stored in the Delta Lake. And then also enables Bryan to do these sort of metadata focus queries that aren’t as natively part of the spark SDK.
So now we’re going to talk about how those tables are represented in the Delta Lake. In an ideal world, we would let Bryan choose any arbitrary time on a continuous timescale for when he would want to select his features by, and again going back to online feature hosting, that maybe be a requirement for some of your use cases. Where we’ve had the opportunity to simplify this a bit, we just go ahead and pre-aggregate the data to some pre-agreed upon time buckets, which really simplifies how we do point in time joins, even if it’s not as flexible and the data’s not quite as timely.
And because now all of our data is at the same granularity, we would have the option then to keep it all in one very wide table. Delta certainly gives us the capability to do that, but still we find that splitting it up roughly by data source gives us the opportunity to simplify schema migration. Again, there are certainly functionality within Delta to do schema migration, but this simplifies things for us a little bit. It makes things easier on the query planner. It makes it easier on our engineers to optimize our ATL jobs. And also it makes it easier for us to schedule these ETL jobs.

Bryan Christian: Yeah. And just to add a little bit about those aggregations and comparison of the online versus the less frequent feature serving is, many of our use cases, especially in financial services, they don’t need to be as rapid as real time if I’m trying to predict like how likely are you to start saving next month? That’s probably not going to change so much from right now to 30 seconds from now or five minutes from now.
Those are much slower moving, probably won’t even change tomorrow that much. Week to week, sure. And so, that’s where we start talking about, well, is daily sufficient for most of our use cases that we are trying to drive, where we’re looking at predicting large scale shifts in behavioral changes. Of course, how you serve ultimately those insights back to a customer, that’s a consideration for channel integration which Nate was talking about.

Nate Buesgens: Yeah. So, what we talked today, we talked about how feature stores serve like these two classes of use cases for the data scientists, but also they enable better governance. And they’re a tool for the ML engineer. A lot of the complexity behind our design decisions for these features stores comes down to making sure that we’re paying proper attention to point in time accuracy of our data. And then we’ve highlighted how there are opportunities to achieve 80-20 solutions mostly by really carefully considering your online feature serving requirements. So thank you very much for attending this presentation. Please stick around to leave feedback and we’re happy to stick around to do some Q and A.

Nathan Buesgens

Nathan Buesgens

Nate is a Data Architecture and ML Engineering consultant at Accenture. He leads the design and technical delivery of complex ML applications. With his background in productionizing research applicati...
Read more

Atish Ray

As part of Accenture’s Applied Intelligence, Atish works with clients in multiple industries to architect and implement scalable data and machine learning applications that drive business transform...
Read more