Misusing MLflow To Help Deduplicate Data At Scale

May 27, 2021 03:50 PM (PT)

Download Slides

At Intuit, we have a lot of data – and a lot of duplicate data collected over decades. So we built a rule-based, self-serve tool to identify and merge duplicate records. It takes experimentation and iteration to get deduplication just right for 100s of millions of records, and spreadsheet-based tracking just wasn’t enough. We now use MLflow to automatically capture execution notes, rule settings, weights, key validation metrics, etc., all without requiring end-user action. In this talk, we’ll talk about our use case and why MLflow is useful outside its traditional ML Ops use cases.

In this session watch:
Robin Oliva-Kraft, PM, Intuit
Maya Livshits, Data Scientist, Intuit

 

Transcript

Robin Oliva-Kra…: All right. Hello everyone. You’re here to learn about MLflow and analysts, how Intuit misuses MLflow to help duplicate data at scale. My name is Robin. I’m a product manager, and Maya is a software engineer on my team. And we’re excited to be here to talk to you today. So we’re just going to give a little bit of context about what we do and how it connects to MLflow. I’ll give a little demo. And then, mine is going to dig into the nitty-gritty details about how this works behind the scenes, and then we’ll conclude. So you may have heard of Intuit, TurboTax, QuickBooks, and Mints. Our mission is to power prosperity around the world, and we do that in a couple of different ways. But we think of ourselves as an AI-driven expert platform where we use data and AI to help our customers have more money and do less work, and have total confidence in our products and how they’re using our products.
And yet, there is a problem in our systems. We have millions of duplicate records, and we need to do something about it because it can cause problems in various parts of our business. So duplicate data is a common problem when you have a large business. Intuit has been around for almost four decades, but in our case, it happens for a couple of different, very specific reasons for our business. So, for example, you might have this company Imagination Inc, that has a payroll product of ours. They may have signed up separately for a QuickBooks product with a slightly different name. They may not have linked the two accounts together. And then a different version of imagination Inc is actually a vendor to another company called Cookies Inc, which is one of our customers.
So the same imagination Inc can appear in multiple ways, in multiple systems for perfectly reasonable reasons. And we can’t necessarily connect them directly through just like database IDs. The way you would ideally be able to do something like that. Similarly, for people, you might have a mint user who is also a payroll user or an employee of a company that uses a payroll service that we provide. You might have a company owner who is also a credit cover member, and duplicates just accumulate. It’s a natural thing. It’s unfortunate, but it’s common. So Maya and I work on a tool to help solve that problem. We call it Entity Resolution, where we are trying to generate a unique representation of a real-world thing. Say a person in a company, a document, a transaction, something like that.
And so we do this through essentially three steps in our tool. We do feature generation, which is essentially just pre-processing data at runtime that in ways that will help her match it. So you might make every record lowercase everything if you’re going to do direct string matching. In the matching stage, we identify records that are related to the same real thing, whether you using string matching or fuzzy matching, and I think we’ve explored using sound decks to match. And then, once you have all potential matches, you need some mechanism for mastering the records together. And so that’s collapsing them into a single record that is one unique representation of that thing. So you have one canonical record of a company or a person or a document or transaction. So the use case for this initially was we essentially want to keep our customers and customer success agents happy and secure. So we want to make sure that our agents can quickly and accurately authenticate callers who are calling them because they want to solve some problem with QuickBooks or mint or something.
They don’t want to spend 10 minutes trying to figure out which John Smith are they calling for and do they have the right access, which accounts? It’s just easier and quicker where there are no duplicates in the data. And so the tool that I’m going to be showing has made a significant impact in improving the quality of our, customer service experience for our users. So, where does MLflow come into this? So the three-step process that I talked about gives a number of different knobs and dials that people can have to tweak a configuration to do Entity Resolution. If you’re collaborating or making changes over time as a data analyst, you’re trying to be productive with this tool. But it can be really hard to build on past work because people track quality metrics manually, if at all, and data is missing. And so that can make you concerned that your output, maybe isn’t what it could be. It’s not accurate, or you might just not be aware of the quality of the output that you’re producing.
And one thing to keep in mind when we talk about this is our user base is analysts, and there are not a multiple target audience. So, in addition, using Entity Resolution, it’s not necessarily their day job. They have other responsibilities. They have SQL skills, maybe some Python skills. They’re not software engineers and data engineers, ML engineers, or data scientists. And yet, we have a self-serve UI for them so they can do this work. Excuse me. So before we started using MLflow, the analysts would have to take information that’s captured in notebooks, stored on S3, in messages in Databricks. Basically, pull it all together in a spreadsheet and make sure that they’re good about typing the exact right numbers, recording every experiment that they do, all the different knobs that they can change to improve the quality of their results.
But that ad hoc manual tracking is problematic in a number of different ways. So it’s incomplete necessarily even if you are tracking, not every experiment or metric gets captured. We saw that in practice. It’s inconsistent because you have different people working on an Entity Resolution project, and different people track the relevant information in different ways. And people working across different projects are going to also track information in different ways. It’s a cumbersome process since you have four sources of information. It’s just cumbersome to go out and find all that information, so you’re less likely to do it. And then, on top of that, it’s not really discoverable. So new users that come into our platform have to create their own spreadsheet in their own way and figure out what metrics they want to capture.
So all in all, it’s just not a great experience. And so because of that, it’s hard to improve because you don’t have a clear record of what you’ve tried. With automated tracking using MLflow, the data is complete consistent. It’s effortless to capture, and it’s discoverable because it’s built into the tool that we built. And so, every job is tracked in the same way without human intervention in MLflow, which we link through directly from our Entity Resolution products. So you can have a nice graph we see here on the slide that shows false positives versus false negatives. And it’s just nice to be able to see how this is changing over time.
So what part of MLflow we’re using? We’re just using the tracking portion. We don’t use any of the rest. Although I believe there are teams that do use other parts, but we just need the tracking, and it’s been a great piece of software for us. And so now I want to show you how this works in practice. So in this demo, I’m going to show you how all this fits together end to end. So this is the curation platform. We’re going to look at this test called in basic entity. This takes a test tagging log entity as input. We run Entity Resolution to duplicate it and produces this entity as an output. So this is the basic workflow for this entity pipeline for this entity rather. So this is the Entity Resolution UI. This corresponds to the three steps that I talked about earlier. So there’s feature generation where you can extract information from fields or change fields on at runtime to simplify processing, so in this case, I’m making ID field lowercase.
The matching step lets you configure how you want matching to happen so you can match on multiple fields. Give them different weights so that you can control essentially false positives, false negatives. And then, in the mastering section, you can define how you want to merge records together and the formatting algorithm that gives you the output that you want. So we’re going to just make a simple change here. So this is a run on a sample of data. Save that and then do a dry run, which is essentially running this in a sandbox environment. So we’re not actually producing any data to data consumers. So it was 211, changed napping SQL. So that’s what I did. This is helpful for my future self, wondering what I did six months from now to manipulate this configuration. So we can keep an eye on this run. I haven’t quite started yet. So here… There we go.
Here it is. It’s running now. This is going to take a little while. So I’m not going to wait for this, but we can see in Databricks in the MLflow UI. We can see this run pop-up in the experiment history. So here it is. And there’s not much to see here, but I want to show that the information that I captured early on in this process appears in the MLflow record. So we can see the summary and the description of the changes. And we can see that this is a dry run.
So I want to go back to the runs page. And if we look at one of these that succeeded, we have this option to open a new notebook. So I’m going to do that. And this is essentially letting me, oh, it takes me to a notebook that has a bunch of [inaudible] functions that help me basically interact with the MLflow data and the data that we’re producing. So I run these utility functions, and I can see that there are a number of MLflow specific functions. So I can insert a new metric column, and I can log metrics. So this is typically used for logging things like accuracy metrics that might be calculated externally, any type of notes that you might want to add, stuff like that. It’s the interface for our users to essentially add to their lab notebook in MLflow.
So what I’m going to do here is for the run that we were looking at earlier that corresponds to this guy here. And so the reduction rate here was 100% because it went 200,000 records to a 100. That seems unlikely, but we’ll just say for the sake of the demo, we’re going to accept that. So what I can do here is just using our helper function tell the function which entity I’m looking for, which run ID. And this is the run ID that corresponds to the run ID in our platform, not the MLflow run ID and they’re different. And which field I want to change and the value. And since it already has a value, I’m going to have override equals true. So here we go. Pretty straightforward.
If we refresh here, we will see the value over here in the corner now, 0.15. I can also add new metrics. I don’t want to do that because we found out the hard way that once you add a metric, you can’t remove it. So even though there’s a test data in a staging environment, I don’t want to mess it up too much. And just for the sake of cleanliness, I’m going to change the value back to the original value. So we should be back to where are we started? There we go. Reduction rate is one as originally intended.
So the last thing that I want to show is just why this is really valuable. So I guess you can already see the graph. I don’t need to run this, but essentially I’ve queried the data for specific information that is interesting to me. I can run this on this function to run this cell to see a nice graph of basically the validation metrics that I care about over time. So I’m looking at false negatives in blue. So that’s going down. False positives are orange, and they’re pretty steady around 1% or less. And the reduction rate, which is the compression of the dataset by merging duplicate records together is around 20%, 21%. What I think is interesting about this, is that we can see that in the early stages of developing this particular configuration, there was a lot of variation over time or a lot of variation as they were honing it. And then, over the last year, it’s been pretty steady. We haven’t really changed much.
The fact that we can capture all this information in MLflow is super convenient. We knew that we needed to capture this information somehow, but we didn’t want to. When we started thinking about, oh, how would we do this? We need a database, and we need some way to have users interact with the database or have our systems interact with it. And then we’re like, oh, that’s basically, what MLflow is. So why don’t we just use that? And this gives our analyst users a really convenient way to store information about the experiments they’re doing with very little work on their part. And they only really need to add anything themselves if they have to insert some custom metrics that we can’t capture automatically. And so with that, I want to hand it over to my colleague, Maya, to talk about how this works behind the scenes and the architecture for how we do it because it seems like it might be a little bit unlike how you would typically use MLflow. So thanks very much. Maya overview.

Maya Livshits: Well, after Robin explained why we’re using MLflow and what’s the outcome, I’m going to talk about the biteable to technical details. So this is a representation of our architecture. We have three different parts of Entity Resolution product. We have the NodeJS app, a Scala spark app, and a Python DBK notebook. We need to call from each of them to the MLflow tracking. This is not the usual architecture that people use for MLflow. So as I said, we have three different APIs. For the NodeJS, we’re using the REST API. The NodeJS is the configure API. It’s used to start the configuration. The Scala spark app it does the actual Entity Resolution work, and we’re using MLflow Java API. And the Python Databricks notebook, we’re using for helper functions for post-processing, and we’re using MLflow Python API. And it’s really handy that we have this option for a very nice integration. Although having a different API means different implementation.
And this is one problem we encountered. For example, the start run, a method in the Python API. In the Java API, it works differently. So we solve this by using a different class in the Java API, which has a similar behavior to the one in the Python API. MLflow tracking contract different data type, which was necessary for our project. We track both parameters, artifacts, and metrics from different parts of the project. From the Entity Resolution spark app we’re calling the MLflow Java API twice, once in the beginning of the run and once in the end of the run.
Now, if we’re calling the MLflow tracking from three different apps, we need a way to have it all in the same row, all the results in the same row in the table for the data analyst to see all the results in one place. So from er configured API, we’re sending the MLflow run to the Entity Resolutions sparker and the Databricks notebooks, we use the job ID to convert it to the MLflow run ID, and that’s what we use to add to the same record for each run. And this is how it looks in the end. Everything is in one place in one row. The user has all his information stored in one place. And back to Robin.

Robin Oliva-Kra…: All right. Thanks, Maya. So to conclude, I can use a user testimonial auto-tracking with MLflow makes historical comparison so much easier. And we can see more of our users in the corner, very excited about being able to use the data captured in MLflow to talk about how that process works and how different configurations can make a real difference in the output that you get that may be more appropriate for different types of use cases versus others. And so, we can see a real site where they used the MLflow UI directly to explain the trade-offs between different types of configurations for the Entity Resolution process. And so with that, we want to open it up for Q and A. I’m Robin. This is Maya. We’re at Intuit. And please don’t forget to send feedback. Any questions?

Robin Oliva-Kraft

Robin is a PM at Intuit, makers of Turbotax, Mint, and Quickbooks. His team is helping to build a data mesh to organize and make accessible data that has been collected over 4 decades across multiple ...
Read more

Maya Livshits

Maya is a software engineer at Intuit, her team focuses on big data solutions. Maya enjoys coding in various languages including python, Scala, node.js and GoLang. Prior to software engineering Maya h...
Read more