A Practical Enterprise Feature Store on Delta Lake

May 27, 2021 05:00 PM (PT)

Download Slides

The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture. 

 

Key Takeaways:

– Understand the key use cases that motivate the feature store from both a data science and engineering perspective. 

– Consider edge cases where there may be opportunities for simplification such as “online” predictions. 

– Review a typical logical data model for a feature store and how that can be applied to your business domain. 

– Consider options for physical storage of the feature store in the Delta Lake. 

– Understand common access patterns including metadata-based feature discovery.

In this session watch:
Bryan Christian, Data Science Lead, Navy Federal Credit Union
Nathan Buesgens, Data Architecture and ML Engineering Consultant, Accenture

 

Transcript

Atish Ray: Hi everyone. This is from Atish Ray Accenture. Today, we are here to talk about a very timely and interesting topic, which is about scaling AI. And in terms of how we are able to industrialize and scale machine learning using a platform like Databricks alongside that. So if you move on to the first slide, so let’s look at what’s happening in the industries today and why scaling AI has become a very important topic of the day for our clients and for the industries that we’re working with.
So it’s basically what we see, there are two trends or two filaments that are converging, as it’s happening now. On one end with the adoption of cloud, with the IT driving a huge amount of cloud adoption migration to cloud of the data and analytics workloads, we are seeing obviously the improvement in terms of the operational efficiencies or the performances that cloud systems of cloud platforms are enabling us with.
At the same time, because of these advantages or because of these capabilities that cloud infrastructures are providing us, there’s a huge amount of incentive from the business to be able to leverage and to unlock the data assets and rethink the way they run their businesses. So basically, drive a business transformation. Now, when most of the cloud migrations are being driven by IT, the reinvention of the business or the reinvention of the other transformations that are obviously being driven by business. And what we are looking at is transform enterprise or a data-driven enterprise, which is basically a conglomeration of both these phenomenon.
It sits on top of a cloud infrastructure. It has capabilities such as data, capital management, or the capabilities around how you enable consumption or consumption of data as well as consumption of intelligence at scale. Obviously, what goes hand-in-hand with those are three critical aspects of how you basically run your business in an adaptive way. How you basically incorporate an innovative culture all across and how you are able to basically monitor and stay on top of the ecosystems. So that your processes are always on in conjunction with the changes that are happening in ecosystem, to be able to make yourself adapt to the situations very quickly.
Typically, so far, obviously we had been focused on the capabilities to a large extent. So what has happened as a result of that as the transformations that were expected to happen because of the conglomeration of the joining of these two events or these two trends have not resulted in the amount of transformation or the extent of transformation that we were hoping to achieve in the businesses. So we’ll talk about that in the next few slides, but if you move on to the next one, what we are doing with our clients is what we generally are calling as a data-driven reinvention programs.
So there are three aspects of those programs. So from a business perspective, we are working with businesses to focus on obviously the initial part of the transformation, where we are leveraging the data and AI capabilities that we’re able to unleash on top of a cloud to be able to fundamentally change some of the ways the business works. Secondly, we are working with our clients and we see a lot of those events around the growing ecosystem partnerships where multiple businesses are joining together. There have been joining hands together to be able to provide a more end to end type experience. For example, McDonald’s working with Uber Eats to be able to bring a great experience in terms of food at home.
Thirdly, obviously, based on the innovative culture that we kind of talked about desperation, what kind of new business models can be delivered or can be created so that the businesses can leverage data as some sort of a competitive edge to be able to bring new services or new models that will put them much ahead of the game in their own industries and much ahead of their competition? Now, as we work on these data reinvention areas, so we are talking about leveraging six capabilities to be able to make that happen. So number one is really focused on identifying a set of what we call us critical data elements that needs to be well-managed to be able to basically deliver the business outcomes from data that are going to make the difference.
Now, what is happening is you don’t have to manage housing elements. You have to identify those critical data elements, which are important for your business, that align to the strategies that you want to incorporate in your business, and basically focus on those. Number two that goes hand in hand is obviously having a significant amount of trust built into those data elements. For obviously, what the purpose of usage and adoption. The ability to basically trace the data from where it came from, it came from from the ability to cover the data well, and the ability to be able to basically the build a continuous trust in that set of data for business use is another key area of work that we do.
Thirdly, obviously we focus a lot on the platform and the architecture in today’s call. I mean, that’s the key area that we work with our clients that is a essential piece of being able to try reinvention. I mean, also that’s a key area of where we want to focus on building the scale so that whether the business experience does, the flexibility that activeness, all those features that we talked about are achieved right through the platform that we are incorporating, or we will incorporate in these businesses.
Fourth, the mindset of building and operating data sets or data products, and basically having an organization and operating model oriented around that is critical for success in this space. I mean, that’s definitely an area that goes hand-in-hand with the platform and the data that to be able to basically drive these business transformation that we are seeing in the market. The final two things, which is really around how you basically incorporate the culture piece that we talked about, that delivers or becomes important piece of the importance, I would say, capability, is behind bringing innovation and the new business models really is around how you enable business to adopt the new platform, the new datasets, or the mindset that we talked about.
The talk obviously involves managing the changes, properly involves a lot of literacy programs, involves how do you make the data easily consumable, accessible by business. All those factors are important. And finally, being able to continuously monitor and measure the outcomes that you’re getting from these programs, to be able to continuously show that you are delivering business outcome is a big piece, key factor for the success. Now, obviously, as we do these data DDR programs, let’s focus on the data platform for today’s discussion, where we will discuss how we leverage the solutions like Databricks, how we leveraged the industrialization of AI ML backbones to be able to scale up these solutions on top of which we are going to be able to drive those reinvention factors that we’re working on with our clients.
If you move to the next slide, a very high level view of the platform capabilities, and in a few minutes, Nate is going to deep dive into these areas and talk about what we’re observing. Obviously, as you see the different capabilities of a full blown data and AI platform, that’s important to be able to establish in the platform. Obviously, we are working with our clients to be able to define those reference architectures, the capabilities, and how they sit together, and how they kind of play together to be able to deliver these brilliant basics, to be able to force the ecosystem partnerships that are needed to be able to reinvent their businesses, and obviously to be able to build these new business models. Now, at this point, I’m just going to hand over to Nate for him to take you through what we are seeing in terms of the maturity of the scalability is in the AI ML space, and then dig deeper into the industrialization pieces. Nate?

Nate Buesgens: Thanks, Atish. Hello, I’m Nate Buesgens. And now I’m going to dive a little bit deeper into how Accenture can help your organization scale your AI applications, which is one of the pillars of our data-driven re-invention approach that I just described. I’m going to talk about how we use Databricks to implement that approach. And when we talk about industrialized ML and scaling AI, one of the reasons we focus on augmenting our data science capabilities with this perspective on scale is because we see that the enterprises that do this see a significantly higher success rate in their data science projects, and they see a significantly higher return on the investment that they get in their data science organizations.
So a lot of data science organizations today could be described as a proof of concept factory. And in a proof of concept, we make some sort of compromise in order to quickly validate a hypothesis in production, and that can be really effective. And then the problem comes in when we find ourself making the same compromises over and over again. So what we’d like to be able to do instead is augment those data science perspectives with additional capabilities that allow us to bake in the lessons learned from each one of our model deployments. We want to bake that into our infrastructure so that ultimately, we can get to a place where we have a digital platform mindset, which is enabling this engine of innovation in our data science organizations.
And scale can mean a lot of things. So we’ll start by identifying some of the ways we can measure scale and verify that we are accelerating your data science organizations. One is by measuring the rate at which we’re deploying models. So through automation and governance, we can deploy models faster, but we can also deploy more models currently. We can also measure how we’re optimizing the data science workflow, especially we can measure how we’re minimizing the amount of time the scientist is spending data wrangling, which may be taking away from the time the scientist is spending on algorithmic design.
And then finally, we can measure whether we’re actually improving the functional performance of our models in production. And the way we can do that is through better tooling and standardization around how we evaluate our models and how we monitor our models once they’re in production. So scale is about ensuring that we’re completing this end-to-end value delivery loop and making sure that our data science outputs and the value created by our data science organization isn’t getting trapped in a PowerPoint, but that’s actually moving the needle in production. It also means optimizing for time to value and the concurrency with which we can deliver value. And it’s about codifying our lessons learned into our infrastructure.
And here, we have an example of that. A great example of that is the Feature Store. And this is a conceptual overview of the Feature Store. For more details on how we’re using Feature Stores that some of our clients, I hope you’ll see some of our other summit presentations, but the Feature Store is a great example of how we can, at the same time, create acceleration for our data scientist. But then also, it gives us an opportunity to layer on better governance of our machine learning applications. So a scientist, when they want to go and deploy a model, often they’ll start by with this feature engineering process where they’re summarizing the data in a particular way, or vectorizing the data. So that’s something that the model can understand. And then when we go to deploy our next model, or if we’re trying to deploy a model concurrently, we’ll find that we’re going through a similar feature engineering process.
And what we found is that the feature engineering from model to model can oftentimes be very redundant. There’s a lot of redundancy in the feature engineering from one model to the next. And if we analyze the data science workflow for where the bottleneck is, this is one of the first areas that comes up. The other problem that this redundancy causes is that it makes it difficult to govern this pipeline. So what we’d like to be able to do instead is isolate those concerns into a production feature engineering process, store those results in a Feature Store, so that when the scientist goes to create a new model, on day one, they have a pool of features they can select from. And then also as an engineer, it gives me the opportunity to have a single place to go to to implement better controls and governance or tooling.
So one of the design principles, our approach to industrialized machine learning is built on is this data architecture principle of the lake house. And this is basically the idea that with new emerging technologies, I’ve broadened the applicable and viable use cases for my data lake. So whereas before, I might’ve had a bunch of niche data science use cases that required specialized data solutions, increasingly, I can build those solutions directly on the data lake. That includes BI use cases, stream use cases, interactive data science development use cases, which can include ad hoc analytics of our pre aggregate data, while also supporting our production applications, especially through this unified batch and streaming paradigm, which helps me ensure consistency between my training and prediction pipelines.
Another key design concept we have in our industrialized machine learning deployments is that model management is important, but it’s not the only thing. We want to manage the whole ML pipeline. And the reason we make that distinction is because some of the governance and some of the tooling that’s required happens at prediction time through model management. But a lot of the governance that we want to bake into our infrastructure happens much earlier in the process, early on in the training process. And so we want the opportunity to production those steps of the pipeline rather than relying on ad hoc processes.
So the way a typical workflow will look in an industrialized machine learning engagement is it will start with this familiar data science sandbox environment, which is very ad hoc, very experimental, and therefore may not have the same production quality change controls that you would see closer to your production predictions. But one thing that makes this workflow a little bit different than what you might find in an organization falling more of a proof of concept factory approach is that the output of this step is not a model to be put directly in production.
Instead, these are research artifacts such as notebooks, parameters, and metrics captured in MLflow, which are essentially being used as documentation to inform how we want to extend our production applications. And then through our CICD processes, this is where we first begin to have the opportunity to layer on those more production quality standards, such as code quality standards. So the output of our CICD process, again, not yet the model, but our production ML pipelines and our production jobs, the first of which would be your production job for training your model. So here we’re using technologies such as Databricks cluster management, Databricks job management, Delta for serving our production features, as well as MLflow for implementing this next step, which is model promotion.
So the output of our productionized training process, which has isolated the concern of training and therefore given us the opportunity to layer on things like responsible AI analytics or standardized model quality evaluation, the output of this step goes through model promotion, where for one, the scientist has the opportunity to validate that what they’re seeing in production is the same as what we saw in step one. But then the business has the opportunity at this model promotion stage to also verify that the performance of the model is aligned with business values for a fairness or risk or model quality.
Then finally, we’re productionizing the prediction process. And that’s what gives us the opportunity to address concerns such as collecting feedback from production and identifying drift in either the model or in our data. So here’s an example of where we’ve put this into practice. We were working at a financial services organization that wanted to personalize the services that they were providing for their members. And as a not-for-profit organization, they had a very high standard for the fairness of their models, in addition to being subject to regulatory compliance.
And so we wanted to meet this high standard of model governance, while not creating new friction for the data scientist. And so what we did step one, we deployed the Databricks lake house platform, which was able to support both our interactive data science development workflows, as well as our production ML applications. Then we implemented the model deployment workflow that we saw on the last slide. And what this did is it gave us end-to-end opportunities to develop tooling and governance such as the Feature Store, so that for every deployment we did, we were able to take the lessons learned from those deployments and bake those lessons, codify those lessons and bake them into our infrastructure.
So what we saw as a result was a significant increase in the rate at which we could deliver models, both in the amount of time it took to develop model and the number of models we could deliver concurrently. And we also saw a significant optimizations to the data science workflow, most notably around the amount of time the scientists were spending data wrangling. They were able to spend much more time on algorithmic design and inventing new and creative ways to evaluate the quality and fairness of their models and less time slicing and dicing the data. So thank you for attending this presentation on how Accenture is using Databricks to industrialize our machine learning applications. And I hope you’ll stick around for Q and A.

Bryan Christian

Bryan Christian is the Enterprise AI & Analytics Lead at Navy Federal Credit Union where he leads data science, MLOps, BI data visualization, and analytics adoption for the enterprise analytics transf...
Read more

Nathan Buesgens

Nathan Buesgens

Nate is a Data Architecture and ML Engineering consultant at Accenture. He leads the design and technical delivery of complex ML applications. With his background in productionizing research applicati...
Read more