Scaling Quantitative Research on Sensitive Data

Worldquant Predictive is a data science company that leverages proven machine learning, AI, and quantitative finance approaches to solve new business challenges across a variety of industries such as healthcare, retail, CPG, and others. The primary business objective is to enable customers to get to prediction and insights faster, while reducing the barrier of entry from a cost and talent perspective. To deliver on this, our data platform requires scaling the analytics workflow across a pre-ingested data catalog of thousands of sources and a pre-built catalog for hundreds of models and having a global research team constantly engaged in finding new models, data sources, and approaches for modeling business decisions. Managing this data at scale is a challenge that only grows when you add words, “safely and confidentially.” It’s critical to ensure that confidential data is protected with automated access controls that maximize the number of hands and minimize the number of eyes working on a specific prediction product. Further, it’s necessary to provide transparency and trust through detailed audits of the data use for customers. Protecting data access across this environment has become a critical bottleneck, and this session will discuss the approaches we are taking to scale our quantitative research efforts in this complex, sensitive data environment.


 

Try Databricks

Video Transcript

– Hi everyone at Spark Summit, welcome to a session by WorldQuant Predictive. We’ll be talking about how we use Databricks and Immuta in our quest to create one of the world’s largest libraries of predictive models. So what are we gonna be talking about today? Well, I’ll talk a little bit about WorldQuants Predictive just so you understand the context with which we’re coming from it as well as challenges we face in scaling quantitative research on sensitive data. And then we’ll talk about what our platform especially the data portion of it looks like, and how we’re using Databricks and Immuta to solve the challenges in delivering large amounts of data safely to diverse and distributed group of researchers.

Well, let’s begin. Who are we? We’re actually started by a man named Igor Tulchinsky, who started WorldQuant Asset Management, very successful quantitative hedge fund and we are a group of people tasked with bringing the techniques and ideas that this quantitative hedge fund honed to industries outside of Finance. In short, we frame business problems as prediction problems, and find ways of making the predictions. We use a large set of tools for it and Databricks and Immuta are two of our best. Ultimately, what we want to do is make every decision the business makes a back tested analytical decision insofar as that’s possible. How do we do that? Well, our main idea is that every person no matter how smart and good they’re, only have a couple of ways of solving any kind of specific problem. So we’ve been out and built a large group of global researchers, some of them here in United States, some of them all over the world to create hypotheses for solving our clients problems and building up our model library. We call it an idea arbitrage principle, meaning that there’s lots of ideas and if we have a lot of them, we’ll actually find a large number of really, really good ones. To make this all digestible, we ensemble these ideas that people have submitted to us and then through automated testing pick the best ones, and have a predictive product. And we then build the appropriate way of interacting with it from APIs to extract to custom tools for scenario planning, simulation, and prediction. Then the goal for us is to improve business decisions through predictive modeling and increase return on investment for our customers.

Okay, so what are some of the challenges for predictive modeling at scale? Let’s talk about that so we set the context correctly.

Limiting Factors for Scale What stops you from trying lots of ideas?

Well, lots, lots of different challenges but four of the main ones are MLOps, I’m sure you’ve heard this phrase. Now, sometimes you’ll hear AIOps, but ultimately, what is it that we have to do to get lots of hypotheses become models? That means running lots of tests, versioning dependencies, refreshes, retraining, all sorts of things. We’ll talk about that very briefly just to set the stage and what else do you need? Well, you need a lot of hypotheses. Hypothesis is really what drives creation of predictive models. I wonder if? Well that’s a hypothesis. We’ll talk about that briefly as well but the bulk of our time here today will be spent on data. Data is literally the fuel that fuels both the hypothesis and testing, as well as privacy and governance. How do we actually make sure that data that our researchers see is what they should be seeing? What is helpful to them and what is allowed by our various contracts with product vendors, customers, public agencies and the like.

Approach: Predictive Models at Scale

Here’s an example of actually a couple of projects we’re working on. Not surprisingly, lots of people are working on helping organizations and businesses deal better with COVID. Actually, some of the things we work on can be as different as working with the public agency on helping understand which of their customers, users, taxpayers are going to be using their services, working with commercial entities to understand when they can be reopening their stores or that are stocking their products. And what is it that we need to do here? On this level, we have to first of all, we need to work with data. So we look, scour the world, as it were, for different data sources that we provide access to. And we’ll be talking about that a lot more later, to our Global Research Network. So researchers as I mentioned all over the world, and we deliberately try to focus on picking people who come from diverse backgrounds. Where you grew up, where you go to school, what your specialty major, PhD was in all provides you with unique life and analytical experience and toolkits. And all of those is what we want to use to generate interesting hypotheses for our clients. One of the most obvious things when predicting COVID effects is, for example, to look at a lot of public data and there’s a ton I’ll be showing some examples later, as well as existing epidemiological models as well as published COVID models. So you’ll see where we talk about SIR that one of the things we had to do is build tools for actually ingesting existing models or outputs from existing models as some of the things that we’re working with if we have the rights to do that. We look at different demographic information, various macro and micro KPIs from the government agencies, and other information often coming directly from our clients or bought from our commercial partners. All of this goes into a sort of unholden melting pot. Different researchers may have specialties in different types of machine learning, statistics, data science. Often they’re subject matter experts in things like epidemiology or inventory management. We try to throw as many different approaches at problems as possible. Sometimes some approaches crossover really well from one type of industry to another. Sometimes they require a specialist approach but overall, what we’re really looking for is how many strong uncorrelated hypotheses we can generate. And we’ll talk about how we make sure of that in the next slide. And then once we find some strong candidates we ensemble them together, and as one of the previous slide said, roll on. You have a strong resilient model resilient meaning something that will withstand the shocks of losses of data, large changes in historical record going forward. Which is something that lots of people are experiencing right now. You may be in a business where historical data predicted very, very well what’s gonna happen tomorrow, then a pandemic happens, admittedly somewhat unprecedented. But suddenly your historical data is no guide to you at all. Part of what we strive to do is to build models that are resilient and in fact, build and identify different drivers for common business questions that we can add to models many of our customers already have that are based on their historical records.

Approach: Mass Ensembling Dozens of researchers create hundreds of models which we combine into a single resilient ensem

Let’s see a little bit more in detail what had happened so if we were sort of to double click on how do we get to good resilient model? You will notice two things. One, it starts with data again. That’s why I’ll be talking about data and privates and governance for pretty much the rest of the conversation today. Our researchers need to access it. That’s kind of the point, right? Once they do build, create hypothesis and build some models. We have these models run the gauntlet. Some of you may have a similar setup, some of you may still be working on it but some of the things that we need to make sure of isn’t simply the performance on the key metric, accuracy or recall. We also wanna see how their hypotheses and models behave in cases when you have malformed input or extreme values. And then finally, we want to compare them to each other so that we don’t have a lot of correlated models. We have models that are changing independently of each other, maybe it’s because they use different data sources or different assumptions. And that really is what creates a resilient model. And then of course, we have to keep doing this as the data updates. That’s essentially how we do ML or AIOps.

The main thing is you’ll see pretty much half of the conversation is about data and researchers and how researchers get access to data and specifical to the right ones. So let’s move to that. Let’s see what our platform is made out of.

I used to call this slide Comfortably familiar refreshingly different. Why? Well, pretty much everybody has a platform that logically looks like this. There is data somewhere it comes into landing zones, it gets processed. Maybe the alpha and meta alpha which is our words for models and ensembles are a little bit different than what you’re used to. And then eventually, the models have to make it to production and operations zones. The two things I wanna point out that made it a little bit harder for us at scale is that even for not overly onerous problems, we may end up with hundreds or thousands of models and dozens or sometimes hundreds of data sources. So the two KPIs that from engineering perspective we’re tracking and really aspire to our speed to prediction and cost per prediction. Speed to prediction is important to us because that’s really what we’re trying to deliver to our clients. How fast can we go from asking a business question or formulating it to having a decent model? That’s what a lot of our custom tooling is designed to facilitate and that’s where a lot of our purchases, like DataBricks and Immuta are designed to facilitate. Cost per prediction is really all of the resources human and machine that we have to use to get to this prediction. And that’s, again, where Databrick specifically was one of our choices and I’ll talk about that in a minute. For those of you who have gone through Databrick PoC, you probably notice that they’re very, very big on asking the clients to figure out what their ROI is going to be so they do not feel like they were cheated later. So what we see here is Databricks specifically plays a pretty big role in our platform. That’s the green squares that you see. You see them all over the place sometimes it’s because they are repeated components in different parts of our environment. But the main ones we’re gonna be talking about today is the square sort of on the bottom left. Where we talk about data packaging and data exploration, as well as data validation right above it. We use data cata… We’ll use a few different data catalogs and I’ll say that for data exploration we actually really like the Databricks one, for some of our other purposes, we use Immuta and custom tooling. So you’ll notice that we have a lot of Databricks pieces, a few custom pieces, the MLOps ones that I’ve already shared with you and Airflow in yellow we’ll be talking about how all of this fits together towards the end of the presentation as we talk about specific example for how do we let researchers actually add data safely and securely to our platform and to themselves. But you can see that Databricks is something we’ve chosen to enable better cost and better speed to insight for our researchers. So hopefully, it’s the same for you. Let’s talk about data for a second.

Predicting Economic Disruption: Data

Here are some of the buckets not too many data sources for a typical type of COVID-19 prediction.

You have to collect a lot of data to be able to answer almost any interesting question, right? So each one of these buckets probably has 10 to 50 data sources for us. So that’s a pretty sizable thing and not gonna spend a lot of time on this slide. But the idea is that you very quickly get to a meaningful data catalog, where anything and everything has to be able to be joined together and quickly made available to researchers.

Predicting Data Categoyies & and macro-economic indicators

And then if we look at just COVID-19 spread

Predicting Covid-19 spread: Data

you can again quickly get to 10, 20, 30 catalogs

and many of them have multiple tables and it’s the same thing, right? What’s interesting here is that, even though not all of the data sets are large, some of them are and as soon as you have a large catalog, that’s where Spark comes in. It’s very difficult to join a small, few tens or hundreds of thousands of rows of data to something that has billions of rows of data like just hard to do in a traditional environments. So for us, speed is very important. So the flexibility of Databricks was a really big deal. Talk about that in a minute but this ability to quickly join across types of data, regardless of where data is, was also a really, really a big deal.

Reducing Scale Limits What stops you from trying lots of ideas?

So how do we reduce this limits that we’ve been talking about? We’ve talked about MLOps that’s our blue stuff. we’ve seen a little bit of, we’ll see a little bit more in a second. Hypotheses, we have a tool called Quanto that our researchers can use as well as all of their standard data science and machine learning tools. Ultimately, hypotheses are generated by people. So the more people we have and the better tooling they have the more hypotheses we get. For data, essentially, we’re using Databricks with an object storage back end. And for privacy, we’re using Immuta. So if we look back at the original slide and now here we see that we pretty much have a tool for everything and from our perspective, the fewer tools we have to develop the better off we are. So I just wanna take like 30 seconds to talk about the custom tooling.

Custom Tooling for MLOPS

Our challenges that we have almost 100 researchers now and growing fast. And we need them to work on multiple problems, work on them together. We’ve looked at a lot of tooling and found that it was most cost efficient for us to build some of our own tooling on top of a lot of open source things. For example, the way we provide feedback to our researchers is to give them letter grades as opposed to actual scores, because we don’t want them to overfit. We manage playbooks which is our scripts, there’s lots of things. If your data science team… You need to look at the requirements for your own data science team to see if that’s worthwhile. Databricks, of course comes with MLflow, which is a very, very good solution for helping researchers go through this MLOps, AIOps lifecycle without developing anything custom. So I just wanted to mention that we have our own requirements that make it difficult to use MLflow and other frameworks so that’s right now, but your mileage will vary and unless you’re dealing with dozens or hundreds of researchers working on the project it’s probably a really good idea to look at MLflow and similar tools. So how did we even end up with Databricks?

Data Access, Exploration, & Sharing

Many of you are probably already Databricks and Spark customers but for those of you who are still thinking about it let me share a story. On the right of this slide, you’ll see a bunch of tables and this is a fraction of the different requirements that we identified and each column is a different vendor and we’ve have few columns that didn’t fit in here. So we looked at a lot of different things. The main things we wanted to look at was, how can we allow researchers to add their own data sources? We want them to generate hypotheses and we want them to be able to say, “Hm-mm I just realized that many of the airports “in United States publish their wait times. “How long it takes to get through the checkpoints online. “Hey, I could get this data source “and maybe it tells me something about macro mobility “in the United States during the pandemic.” Like we have pretty good tooling for adding data sources but it’s nice to be able to do this yourself. We wanna see how researchers can combine data sources and query them. Can they do all of this quickly? And what is this going to cost us right? What is it going to cost us when nothing is happening? So we evaluated a lot of different tools and really felt that Databricks was the right one for us.

Practical points to consider Tons of requirements, but in the end it comes down to… More Hvpotheses exnlored means

For one we really liked Spark. We like the idea that we can mix exploration and operational. We liked the scale, we liked the ease with which data sources can be added and managed and virtualized.

And then when we looked at that and we looked at a few different vendors, we felt like Databricks was a really good partner for us and I will also add, I’m sure the Databricks sales team won’t (mumbles) but they are suckers for helping people figure out how to use data bricks when you get started. So if you get a right sales team, they will really help you get pretty far along in your process if you aren’t already. And some of the decision points that we want to think about was luck, for us more hypotheses that we explore means better predictive product. So the tool we’re looking for is something that’s going to allow our researchers to either generate or test as many hypothesis as possible in the course of a day or a week. For you, you’re gonna have to think about things like, how much am I doing operationally in Spark versus exploratory. What I mean by that is that lots of organizations use Spark because their data volumes are just so large, it’s the only way to practically do what they need to do. That’s an operational use case. If you’re doing exploratory, you may find like we do a lot of exploratory stuff. We wanna give researchers access to many data sources. So for us, for example, notebooks were kind of a big deal. The ability to do things in Python or SQL was kind of a big deal for us, right? Costs are different so you’re gonna have to figure this out, but that’s some of the things I would want people to consider. Think of what else you have to integrate it with. We have the luxury of building things from the bottom up so we integrated with whatever we want. So that really helps us and we found that Databricks was really, really strong and strongly managed deployment.

And it was very easy for us to integrate it with our existing tooling.

Are you looking to be repurposing existing dollars? A lot of tools are basically, “Hey, you’re spending a ton of money on X, “we’re like X on steroids but at half the price.” In our case, we were spending fresh dollars so we really wanted to make sure that we get additional value. And then finally, it’s how big and advanced your team is. This isn’t so much a decision point from Databricks perspective but a lot of the niche use cases that we have to deal with are partly because we have a lot of researchers, we run a lot of different types of clusters for different kinds of predictive problems. Alright, so those are some of the decision points that you are gonna be making. And even if you have chosen Databricks or some other Spark implementation, those are decision points that you’re gonna be making over and over and over again. As organization changes, as business changes as needs change.

Protecting Client Privacy Challenges Alot of juggling

So let’s talk about client privacy challenges. Here is five different balls that we’re gonna have to keep in the air at all times. We need to share data with lots of researchers all over the place. Lots of restrictions about it.

Some are easier to deal with, for example, in our case, we’ve just made a decision that all of our researchers use virtual machines that are based in the United States and they can’t download things. So yay! But you’re still sharing data with a lot of people and in our case, it can be a lot of people. So we want to know who it is and we need to know where they are. We want to generate tons of hypothesis from lots of data sources. So we don’t want to control access to data so tightly that people can’t do that. All right, we need to track things because we have contractual obligations both to clients, and vendors we license data from, we kind of have to pick and choose a little bit whom we trust and how. And of course, did I mention speed is important? We need to get it done fast. So how do we deal with that?

Yay, colors. So all of this is really shaded, right? Like, I pick primary colors for these things but realistically, there is about a little bit of everything in them. But for the purposes of this, let’s think of things about trust and access we work with Immuta. And for things like data and speed we work with Databricks, and of course, and I’ll be remiss if I didn’t say it, one of the biggest things for us when picking Databricks, and Immuta was that they work well together. So for us, the ability to deploy Immuta with Databricks in a couple of days, for a proof of concept was a pretty big selling point. So let’s actually look more closely how Databricks and Immuta work for us.

Databricks solves for data for us, right?

Databricks solves for D(ata)

When we actually worked with the sales and sales engineering team, like many of you probably we had our document and it really came down to things like understanding how query performance compares to other options out there. Right, and what would we have to do? How much work would we have to do to get comparable performance? I’ve spent many years working with relational databases and I’m sure many of you know somebody in your organization who can make any kind of MySQL or PostgreSQL query the same. The question is, how much effort is it. and in our environment where we have a lot of data sources and a lot of different types of joints, spending a lot of time on any particular query isn’t ideal, right? And we certainly don’t want researchers to do that. That’s more of an operational type of optimization. So we needed something that would work really well out of the box and we found that Databricks did really, really well, especially once we figured out how to do our cluster management. We wanted researchers, one of my personal goals is that we want researchers to be able to submit models into our platform within the first day of being onboarded. And if Databricks becomes part of that makes them they have to be able to figure out how to use Databricks in some basic way, pretty quickly. We found that Databricks did onboarding really, really well and support for multiple languages is big part of it. I know a lot of it is Spark stuff but, in our case, we feel like we pay for management and tooling on top of a really strong Spark implementation.

Security, we use Databricks Enterprise combined with Immuta and one of the things that we measure is, what’s the amount of time between a data source being identified, let’s just make it a public data source and it’d been available to researchers? That’s how long it takes for us to go through the cycle and we want the tooling that allows us to do it fast, and customization that allows us to do it even faster.

Supporting ad-hod datasets ingestion

I wanna actually bring up an example of something that I think we did that was pretty cool that fits well into what we’ve been talking about so far.

Think of what it takes to support ad hoc data set ingestion it’s a use case that isn’t particularly important operationally. Most of you who care about it from operations perspective would say, “But wait, we know where all of our data “is coming from we have this feeds, this terabytes of data.” But again, if you’re looking at saying, “I want researchers to come up with fresh ideas, “and they’re gonna constantly be playing around “with different data sets.” We’ll want to allow them to basically say, “Hey, I found something, how do I edit?” Great, lots of ways to do this except, and we’ll talk about the except in a moment. Databricks supports have your own data set but it feels a little bit tricky to just allow people to upload anything they want into our environment. It’s okay to upload it, but do we really want it to be available? Whom does it become available to? So the process we’ve come up with is actually to use Airflow, which we use for orchestration for everything related to data and your favorite object or data store. And we use Databricks’ notebooks and dashboards to actually create a light workflow that lets researchers add the data and for us they feel comfortable that this is safe let’s look into detail for it. So, here’s what happens. Starts out the way you would think, researcher identifies the data set and what we do is we have special landing zones for researchers, so they can just drop these data sets into the landing zone and it gets picked up by Airflow and runs a DAG, Directed Acyclical Graph that basically has a bunch of different things that we want done, moves the data around, but most importantly triggers notebook that checks this data for value.

We have a few of these notebooks and we’re constantly elaborating on them. You can see a very simplistic example on the top right of the slide. But basically, we run through a bunch of statistics commands, figure out basic descriptive statistics about this particular data set. We can look at how many sales are filled and so on and so forth. And then, in fact, using dashboard, we can actually fill out a little bit of a command. It’s kind of hard to see here on the top of the dashboard screen, but ultimately we can actually allow the researchers to say yeah, yeah, that’s what I thought the data set had. Go ahead, put it into something that raw and trusted. Those are our two holders for data. Okay, sounds good so far, maybe a little bit too much work for simply copying a set of files from one bucket to another. But here’s the cool part, what we can do is actually automatically create the data sources in Databricks so this becomes available to other people, we can set it up scoped by project so that it’s available to all of the other researchers on the project and maps to the right database cluster. And then finally, which I thought was pretty cool we can use Immuta, which is I’ll talk about a little bit more in detail in a second to create a protected data source which will actually check the data source for things like PII, PHI and other types, any kind of information we don’t want researchers to see and for more let’s go to the next slide.

Immuta solves for P(rivacy)

What does Immuta do? Well, Immuta allows us to create policies and essentially personas context that allow us to understand how the data has been accessed. Immuta tracks all of the data access, creates an audit trail for us and provides access to researchers using standard ODBC/JDBC style setup. So far, so good. But some of the the two features that I really, really, really like and I have examples of them on the bottom, below the text and the upper right corner as well. One of them is called Context. What Immuta let’s us do is to say, “Sure, your user are authenticated, “but how are they actually using the data?” So by creating this context, for example, when I connect them in the directive manner, I can have the context of an explorer, whereas when I am running our server side tests and my persona is that of a model testing environment.

What data is made available through the exactly the same queries. It’s completely transparent to researchers can be different. We can do things like limit the number of rows or how many days worth of data to hold back. We can do that. We can create a policy that basically says, If I access this data source as a researcher, then I just don’t get to see the last 90 days of data no matter what they do. It doesn’t matter what kind of Spark or Databrick query I write, it’s done. You don’t get to see the last 90 days of data and that is very, very powerful. It’s a both from an audit perspective and comfort perspective, and frankly not making stupid mistakes when you’re trying to fit your data. It’s very, very nice. Another thing that is really, really powerful, is differential privacy. Differential privacy is a mathematical technique that basically jiggers I know there’s a better word for it. Creates a certain amount of randomness in the data that does not affect its statistical properties. But actually, in effect, anonymize it and changes the data. So Immuta, for example, supports policies that says when I’m accessing the data in an exploratory state, on the server without using differential privacy, so that what the researcher see isn’t the actual data. It’s close enough for the purposes that they have. There’s gonna have roughly the right number of Monday’s for example and Tuesdays, but it’s not actually the data that is in the database. It’s essentially on a fly synthetic data set that is statistically different but protects the privacy. Immuta has many other properties such as masking rows masking data, but these two combined with our ability to apply them to command means that we can put essentially up the street data source into Databricks with Immuta

and feel comfortable that what researchers are seeing is not gonna include any PII, PHI, or any other harmful information oh, and by the way, when we’re dealing with sensitive client data we can actually obfuscate it in a way that makes no difference to the researchers look through a billions of rows of data because it’s statistically correct. But actually isn’t showing any client data

to any of the researchers in case it gets lost or anything like that. Obviously, we still work with clients on protecting this data and following all of the guidelines. But this is a really, really nice set of features to feel us and our customers more comfortable with how we can protect their privacy and the privacy of the people that they represent. And still do cutting edge machine learning and data science on their data. So to summarize, Databricks gives us scale and speed Immuta gives us privacy. Databricks and Immuta together is a good chunk of what we offer to our research team to work with.

Thank you. Thank you for joining us and I hope you learned something during this presentation.


 
Try Databricks
« back
About Slava Frid

Worldquant Predictive

Slava is a 20-year+ technology industry veteran focusing on solving business problems with Software Engineering, He has led award-winning teams in finance, media, and non-profit sectors and enjoys pushing the envelope of what is considered possible. He is currently embracing the challenge of helping to build a new way to enable Quantitative Researchers to answer toughest questions businesses and organizations have. He received his degree in Economics from the Wharton School and his Computer Science degree from the School of Engineering at the University of Pennsylvania.