Agile Machine Learning for a Dynamic World
Want to watch instead of read? Check out the video here.
Welcome to the first Databricks practical ML virtual event. I’m Ben Lorica and I will be your host for today. We have an exciting lineup for you today, including presenters from Databricks, as well as the Medical University of South Carolina, MUSC. In addition, we have also prepared a selection of some of the best talks from Spark + AI Summit so that you can watch them at your own pace following this event.
We hope you’ll really enjoy hearing about applications of machine learning in a variety of settings. We will kick things off shortly with an opening keynote from Clemens Mewald, followed by a product demo from Sean Owen. And we will cap off today’s event with an interview with Matt Turner, chief data officer at MUSC, and live Q&As to follow.
Our opening keynote will focus on building a modern data science and machine learning platform for real-time response, as well as overcoming the small data challenges in machine learning. It will be delivered by Clemens Mewald, director of product management at Databricks. Clemens will be followed by Sean Owen, principal solutions architect at Databricks. Sean will be giving a demo on improving forecasting models during COVID 19.
Building a modern data science and machine learning platform
All right. Let’s discuss how to build a modern data science and machine learning platform for real-time response in these uncertain times. So I’d like to start with the drivers behind the latest AI trend. And it’s important because we’ll really see that some of the assumptions that we’re making in modern AI actually don’t apply in a very fast-changing environment. And we’ll talk about some of the challenges we see with small data.
Because the AI trend really has been driven by an ever-increasing amount of data in the last couple of decades. So the way to read this graph here is on the horizontal axis you see the amount of data as it increases and on the vertical axis you see the performance of the models as they consume more and more data. And really, a couple of decades ago, we found ourselves in a situation where there just wasn’t enough data to actually have deep learning systems outperform other approaches, where you would have to do manual feature engineering.
So this was really in the ’80s and ’90s, where deep learning was a theory, but it really didn’t apply well. But as through the last couple of decades the amount of data has been increasing, we’ve really seen this resurgence of deep learning and really see these deep learning models outperform all other approaches that we can see in the AI space. And of course, we expect this to continue increasing over time as the amount of data keeps increasing.
Of course, the amount of data is only one side of this coin. The other one is as more and more data is available, you actually need more compute power to process those data. So the second dimension here is that the AI trend is also driven by an ever-increasing amount of compute power. So the way to read this graph is on the horizontal axis you see the number of FLOPS and billions that are spent to train each one of these machine learning models and on the vertical axis you see its accuracy on the mission and the problem.
And of course, what you can see here is that as the models become more and more complex and the size of these models also increases over time, you need to spend more and more compute power to train them and they become more and more accurate over time. So taken together the availability of more and more data and the availability of more and more compute really has been pushing this trend towards more and more deep learning models that perform better and better with the more data and compute you actually throw at them.
And this trend really has served many industries and use cases as well. If you think about it, these days, almost all of our purchases happen online. So e-commerce companies really have very rich profiles of all of us and like what products we actually buy. Location data is also really pervasive and can be applied for a lot of different use cases. We’re actually seeing in these times of COVID-19 that location data collected from phones can also be used to actually assess whether social distancing is actually working.
And of course, as you may be able to guess, if you have hundreds of millions of mobile devices, this location data becomes very rich. In IoT, internet of things, there’s also more and more sensor data that is being collected. Of course, location data is IoT sensor data. But there’s many, many more use cases for this. One example is that more and more transportation companies, of course, send data back to be analyzed; or even in agriculture, a lot of the harvesting machines are also equipped with a lot of sensors.
And last but not least, of course, many, many financial transactions have been made online and digitally over time, but you actually see this amount of data increasing as well. So as you may be able to guess, each one of these areas and many, many more are producing more and more data. And this taken together with more compute power really has helped AI and deep learning become more and more popular in the last couple of decades. However, what happens if none of that data actually makes any sense anymore?
So here you can see a graph of the LA Metro system wide ridership. So this is both bus and rail ridership in LA. And as you can see here over the last two years, roughly monthly ridership has been probably at like around 1.3 million. There may be a little downward trend over time, but it’s pretty stable. So if you asked me in February 2020, what do I think this will look like going forward? I probably would draw some trendline like this because it’s pretty safe to assume that this trend is continuing.
And you can probably already see where I’m going with this, which is that, of course, because of the COVID-19 crisis, things played out very differently. So this is the actual data as of July 2020. And we don’t have August yet because as of recording this talk, we don’t have data for August yet. But if you asked me in July 2020 what I thought would happen next, now I would actually have a much, much harder time predicting what would happen.
Because you could assume that maybe it’s just going to continue on this like new trend that we’ve established. Maybe they actually like to resurge back up again and then continue this trend or maybe it will come back down again. I think you get the idea. It’s really hard when these types of events happen to actually rely on historical data to make any future predictions. And this is only one category of common small data challenges. And I’m going to talk about three.
The first one is what I just mentioned, is really disruptive discontinuous events. Then there’s cold-start problems and also hyper-personalization. So let’s dig into each one of these. So disruptive and discontinuous events can happen for all kinds of different reasons. Another example is actually products going viral. So if you have some product where for some reason the demand really, really picks up from one day to the next, often it’s very hard to predict demand in the future.
There may be a lot of competitive and regulatory changes in your market that also are disruptive to the data that you’ve collected so far. And then of course, there’s this category of all kinds of natural disasters and pandemics that really make it hard to rely on any historical data. The cold-start problem is a problem where you don’t have any data available for the problem that you’re trying to solve, not even historical data. That’s no longer relevant.
And of course that can happen for several reasons. One of them is you have an entirely new product or a new company and you just don’t have any data collected at this point. The other category, which is also very common, is it’s very costly to acquire labeled data. And this can happen in areas such as medical imaging. If you imagine, every time you need a pathology image labeled, you actually need a doctor sitting down and spending a lot of time on it. And of course, doctor’s time is very valuable.
So this is another case where you really have a problem where there’s not a lot of data available for you to train these models. And third is what I call hyper-personalization. So this could be the case when maybe you do have a large number of users overall, so maybe you have a product that has 10 million users, but the data for each single user is very, very small. And that may be relevant because if there’s regulatory or privacy related reasons why you can’t actually use the data of all of your users together, then you find yourself in a situation where each one of these users has a very small amount of data. And that also needs to be addressed.
So let’s look at some of the solutions to these small data challenges, specifically how Databricks addresses them. So first, for the disruptive and discontinuous events, there’s a couple of ways to approach these challenges. The first one is actually data augmentation. So you can actually augment your existing data with other data that helps you make predictions in the future. Secondly, once you have that data augmented, it’s extremely important to actually automate and increase the frequency with which you train these machine learning models.
Because if it takes a month for you to train a machine learning model, it’s probably already outdated in uncertain times like today. And last but not least, actually simplifying the consumption of these models is also important because the deployment life cycle often takes way too much time. So this is a simplified overview of Databricks as a data science and machine learning platform. And you see all of the different components that you would expect on a platform to go from ingesting all kinds of different data to training machine learning models and deploying them, and also managing the end-to-end machine learning lifecycle with MLflow.
And as I’m going through the solutions to these small data challenges, I’m going to be highlighting different areas in this deck. So first we talk about data ingestion with Delta Lake and Spark. So if you find yourself with the need to augment your data, it’s extremely important to increase the diversity, fidelity and frequency of the data that you ingest. So Databricks actually introduced a feature called Databricks Ingest that provides a partner ecosystem to actually ingest data from all kinds of different dataset, databases, application business stores, and also like general file and storage systems.
Now, that allows you to actually, with very low latency, pull in all of the relevant data from the systems to make sure that you have the data available for your predictive models. Now, once you have these systems available, it’s extremely important to actually make sure that the data is ingested in a timely manner. So we also introduced this feature called Data Auto Loader, which actually streams in data as it arrives in blob storage systems such as S3 or ADLS into a Delta Lake table.
So for data scientists, it couldn’t be any easier. You really just need to drop the new files into like a directory in S3 or ADLS and Databricks automatically detects them and streams those files into a Delta Lake table to make sure that it’s immediately available for your downstream use cases. Now, once your augmented data is available, now we move on to the machine learning training part. The machine learning runtime on Databricks gives you a turnkey environment to actually train and tune these machine learning models to make sure you have the highest quality predictive models available to you.
Now, as mentioned, in the case of disruptive and discontinuous events, it’s extremely important to increase the frequency of training these models and also automate the process. So with the machining runtime, you can actually automate the process by scheduling these training runs as jobs to make sure you always have the latest machine learning model available to you and not have this be a manual process. And once these models are trained, you actually need to make sure that the deployment lifecycle is as easy as possible and you have many easy ways of deploying these models.
So with the Amazon model registry and the new feature on Databricks called Model Serving, we simplify the management of the lifecycle of these models with the model registry. So there you can easily hand off models from data scientists to deployment engineers and it’s not a manual process that takes forever. And then with model serving, you really have a turnkey solution where with two easy clicks, you can expose these models as REST endpoints so that they can be consumed in reports or downstream applications.
And this really streamlines that process of taking a model that was trained, let’s say every hour, to be as timely as possible and deploying it for consumption downstream. In many other cases, if that deployment lifecycle takes a month, you already have outdated information. So coming to the second category, which is the cold-start problem. And the solution here is really a method called transfer learning. “What is transfer learning?” you ask.
Well, transfer learning is a technique where you can train a machine learning model on a set of data that may be completely unrelated to your actual problem and then transfer the learning from that model to your specific domain. So in this example here, you can see, this is taken from a paper, a model that was trained on ImageNet. And you can see some of the example pictures here, like there’s coffee beans and there’s a flower. And then that model is then fine-tuned with pathology images for medical imaging.
And it’s really surprising, but it turns out that models that are trained in such a way, so like trained on a completely irrelevant set of data and then fine tuned on your domain-specific data outperform models that are only trained with your specific data by a significant margin. So what you can see on the right is in this specific example, the AUC, which is a quality metric and machine learning called Area Under the Curve, which is better if it’s higher, you can see that the model that was trained with transfer learning, meaning that it was initially trained with ImageNet data and then fine-tuned with pathology images, significantly outperforms a model that was only trained on pathology images.
And to many of you, this may be very surprising, but it turns out that this method of transfer learning performs this really, really well. And it generalizes also beyond just images. So how do we solve this on Databricks? As I mentioned before, MLflow has a component called the model registry and the model registry is really the central hub for you to exchange models with your teams and manage the deployment lifecycle. So let’s say someone on your team trained a merchandise detector. So like an object detection model that detects merchandise and images.
Now, another member of your team may be able to find this model in the model registry and actually transfer the knowledge from that model to a model that detects people in images as well. And this is really one of those examples where I can almost guarantee you that that people detector will have a higher performance using transfer learning from a different object detection model as opposed to if you just started from scratch with only the images from your people detection problem.
And last but not least, let’s talk about hyper-personalization real quick. And the way that I like to frame this is actually a big small data problem. And the solution to this is really massively scaled machine learning ops. Let me explain. So in hyper-personalization, and this is actually an actual use case in Databricks, you may have device data, and like this is an IoT use case, where let’s say for each one of the devices you have some data about like the sensor readings.
Now, you can actually train a machine learning model for each one of these devices usually just by applying it to a groupby of a data frame. So in this case you can see SQL code, but of course you can also use Python APIs for this. We basically say this trained forecast code is being applied to each device ID and then use the machine learning runtime to paralyze the training of a specific model for each device ID. So in this case, let’s say we have like three devices. You actually train three models in parallel, and then all of these models are stored in the MLflow model registry.
And then of course, when you perform prediction, the forecasting code can just pull these models out of the model registry. And for each specific device ID, pull the appropriate model, computer the predictions and write them on to a Delta table again. And this is a very common pattern that we see with our customers and this actually scales to unbelievable amounts of models. We actually have a customer on Databricks that has 2.6 million machine learning models trained in this way.
And if you really need a specific model for each device ID or for each user or like for each other identity that you’re training these models for, this is an extremely scalable way of actually training a specific model for a specific device or person or other entity and actually managing that complexity with the MLflow model registry.
So in summary, the Databricks end-to-end data science and machine learning platform provides you with all of the components to address the common reasons you would find yourself with small data and actually react quickly to these changes in your environment. And in the demo that’s following my talk, we’ll actually show you how you can solve for the disruptive and discontinuous event case on Databricks.
DEMO: Forecasting and improving predictions in the time of COVID-19
Thank you. This is Sean Owen. I’m a principal solutions architect here at Databricks and I focus on data science. I’d like to spend about half an hour giving you a more specific example demo of the themes of today. Of course, that includes forecasting and improving predictions in the time of COVID-19, which has, of course, changed a lot about the world.
And the theme here is not going to be building the best prediction we possibly can, or the fanciest one, but building something quickly with off-the-shelf tools, getting from maybe nothing or something primitive to something much better quickly and using the tools that you’ll see in Databricks. The problem is making predictions is tough, especially about the future, as Yogi Berra, a baseball great once said. And that’s all the more true today, I think it’s safe to say, right?
Predicting the future is hard even in good times. We may wonder, is this customer going to churn? And build a model for that, or try to predict the value of building a new feature in our product, or of course, predicting things about our business. Like, what are sales going to be like next month, by state, by city, by country? And that’s always been hard, of course. But 2020 is nothing if not not good times. 2020 is arguably a bad time and that is, of course, because of COVID-19.
It’s affected people’s lives, of course, in terrible ways, but it’s also simply changed a lot about how we live and therefore how we shop and it’s changed some rules and regulations, and that all has impacts on a business. And it’s not just that these changes came rapidly, it’s not just that they were significant in size and scope, it’s that they’re still uncertain going forward.
Here’s a snapshot from the IHME model, which helps predict the progress of COVID-19 into the future. And as you can see, it’s still not clear what’s going to happen with it going forward. All that means that making predictions about the future is even harder. But we still need good predictions. We need good forecasts. We probably need better forecasts than ever before in the face of this uncertainty.
Now, that’s been hard today because sometimes it means finding new tools that are complicated. Maybe it means hiring a particular expertise in forecasting. That’s just hard to come by. We’d like better forecast, the hard-to-make. But today, more than ever, we need those better forecasts and we need lower latency forecast too. Maybe in the past it was sufficient to forecast a month out. Now we maybe need those forecast on the scale of days out, and we need them to be pretty accurate.
Of course it would be nice to go back in time and have collected a lot more data about our business than we were doing five years ago, three years ago. We can’t do that now. But maybe we can use alternative data, publicly accessible data that is nevertheless relevant to the forecasting problem our business. Of course, we’d love to spend six months building out a new data science team and polishing a new forecasting system and product, but we don’t have those six months.
The world’s changing fast right now so we need something maybe better than we had last year and we need it right now, more or less. And so we need to prioritize speed-to-market. So in this example, we’re going to try to do these three things. Number one, use better forecasting ideas. We’re going to need something that can adapt readily to changes in trends, or really changes in the underlying reality of the world, of the business and how customers behave.
We prefer something that works out of the box and requires little tuning. And we’re going to therefore pick a tool called Prophet, which I’ll describe later, that ticks those two boxes. Now, of course, the major source of uncertainty in business these days is COVID-19 and its effects on rules and regulations and also consumer behavior. So we’re going to grab some public forecasts from the IHME model.
The IHME model not only provides data on the progress of COVID-19 to date in terms of things like infections and deaths, but provides projections for those values ahead, by state, by city as well. And we need a tool that can use that alternative data, and it will turn out indeed that Prophet can do that. Last, we’re going to prioritize speed-to-market. We want to get an MVP release, a minimum viable product release for this new forecasting system. Not necessarily the best, fanciest solution we can imagine.
And so for that, we’re going to want to choose off-the-shelf open tools that work well with the mainstream ecosystem. That also describes Prophet, but it also describes tools like MLflow, which I’ll show you how to use here as well. And we’ll see that with a little effort we can actually buy ourselves quite a bit of accuracy and sophistication in a system like this.
Now, I don’t have a small business that I operate here that I have actual sales data for. That tends to be private. So we’re going to make up sales data for a fictional business called Acme Corporation, which I think maybe some of you are familiar with. Let’s just say that they’re a hardware retailer present in all 50 US states. There’s all kinds of tools. We don’t have retail sales data for this fictional company, of course, but we can grab average mean US retail sales data and use it to generate some fairly realistic looking sales revenue data for this company.
So to start, we’ve done that. So let’s say we have revenue for this company. And this company, Acme, does have a forecasting system. It’s just very primitive. Their forecasting system simply assumes that sales grow by 1% per month per state. And that was an okay assumption in the past. For the past 50 years, their business has been quite stable, but it’s no longer stable in 2020 and it’s affecting their ability to forecast where to put production, where to put their sales effort into. So we’re going to try to improve Acme’s forecasting system.
Let’s get to it. As promised, as I say, Acme, it’s a little bit modern. They have a forecasting system; it’s just a bit of Python code. And maybe this lives on somebody’s laptop somewhere. And this may be familiar for an organization or a division that maybe hasn’t put a lot of energy into forecasting so far. Maybe there’s some legacy system like this. This one’s very simplistic, of course. It’s low data.
This is generating the fake data here and then simply making forecasts using Pandas according to this simplistic rule of 1% growth per month. But that’s where we started. So maybe we want to use this as a jumping off point. But we want to go ahead and put this into a better environment and apply better tools to the same data that this is trained to forecast from.
So we hop over to Databricks. First thing we can do is create a new project based on this code. Let me go grab the URL. So we can create projects based on existing repositories of code like this. And as it happens, we can happily just execute this in Databricks. Let’s pick a cluster that’s available here and just run it. Now, the nice thing about Databricks is it can run code not just that uses Spark, but that uses common tools like Pandas as here and can read data in the cloud. No problem.
And this is because this is all part of the runtime that’s included with Databricks. So indeed this runs quickly. And we can see, for example, that it prints the revenue data and then prints the forecast by state. And you can see off the bat that the forecasts do not match revenue very well. For Alaska, for example, in April, May, June, the errors are 30%, 20%. And that just sounds far too high to be of use.
So this is where we start. And maybe we go into the workshop and spend some time developing a better solution in a project here and we may end up with something like this. I’m going to switch to this notebook here. So let’s fast forward a week or so. And this is where we’ve gotten after playing with this with some simple open tools for about a week.
First thing we’ve done here is modify the code a little bit to read the data with Spark because maybe that’ll be useful later. We didn’t have to do that. The display is a little nicer. Sure. But we can do things like not just print the errors in the predictions here, but plot them too with a nice built-in plotting. And this makes it quite clear that, for example, for Arizona, the actual revenue in March, April, May, June is not at all well predicted by the forecast, the orange line here. It just lags by, well, a month, pretty much by design. And it’s well off over time.
And this just won’t do anymore. The areas are too large to operate a business forecast this way. So what can we do? Well, as promised first, we’re going to try better tools. And the nice thing is the open source world has a bunch of powerful and effective tools for things even like time series forecasting. The one we’re going to pick and look at today is Prophet from Facebook.
It’s an open source project. It has a couple of nice properties. Number one, it can adapt to changing trends. And that means adapting to changes in the underlying reality that are generating revenue, generating customer activity. It also out-of-the-box deals well with seasonality. That’s good. And this is sales data and sales data is inevitably seasonal, including things like holidays accounting for, in this case, UMass holidays.
And it can also give us things like uncertainty and estimates and confidence intervals. And it does all that really out-of-the-box without much tuning. So let’s see how that works. Well, we’re going to load the data for, let’s take Arizona. Let’s just focus on their main market, which is Arizona. We’re going to grab this as a Pandas data frame. And this is all it takes to use Prophet, really.
Instantiate it, have it fit a model to this time series, this revenue time series. And Prophet also integrates nicely with things like Plotly. This is a plotting library that generates these nice interactive plots like this. So just a little bit of code like this, you can get pretty nice forecasts. And this, I think, obviously looks a lot better than where we started for Arizona.
The black dots here are actuals, and this is actual sales data by day for Arizona. And you can see, of course, it’s dropped as the pandemic kind of set in initially in March and April, recovered a bit, and yet is still declining again. So Prophet’s produced a prediction, the blue line that seems to track this retroactively pretty well and predicts, generally speaking, a downward trend going forward.
You can see it accounted for things like holidays. This is the effect of July 4th. You can also see it provides a confidence interval. This is an 80% confidence interval. And it helps us understand, although this seems like the average end, that it gets quite uncertain as we go two or three months out. And that seems natural. There’s been a lot of change here. There might be some more change to come. But that’s really not bad.
So we might wonder, though, how good is it? Eyeballing it, it looks okay, but how accurate is it? Well, again, Prophet provides some nice tooling for cross validation. And this lets us, for example, evaluate how accurate the model is via mean average precision error as we try to predict two days out up through say 14 days out. And of course, as we predict further out, the error increases. But we’re doing pretty well.
It starts at maybe about 4% going up to 8%, two weeks out. And clearly we can make predictions just even a day out with some fairly small error, not 20% or 30%, but like 4%; even two weeks out, it’s only about 8%. So this is already pretty good. These predictions, the profit they’re generating seem far superior to the simplistic one we generated so far.
Now, we could declare victory and go home, but there are some more things we can do without much trouble that would make this process more robust and maybe make the models a little bit better. So let’s keep going. Number one. One thing we can do, of course, is log this model with MLflow. MLflow is an open-source framework for tracking models and experiments, and also helping deploy them to production.
It’s easy enough to log the model and its error metric. And we get a nice display like this, for example. This records in an organized way who made the model, when, how long it took, what exact revision of the code was used while building this model. And of course, we can log not just the model, but nice artifacts here like that plot we just saw.
This makes it easy for people to see the output of a data science process, review it historically, and track it in an organized way. We can also go a little further and go ahead and register a model in the model registry here for Arizona and add this run as a new version of that registered model. Then we get something like this. So you can see we’ve been working on Arizona, let’s say in the lab here. And we built a couple of different previous models we thought were good.
This new one we just built, this is now registered as version three and it’s the current production model. And that’s nice because maybe another process needs to take over soon and always ask or know what the latest version of the model is for Arizona. And soon we’ll see, of course, all 50 states. Now, as advertised, maybe we can do a little bit better if we add other data.
We don’t have any other internal sales data, but we can go grab public data that’s relevant to the problem. And in that case, the problem is COVID 19. There are a number of public models out there that not only can provide datasets they can provide that tell us about the progress of key metrics recovered like deaths and infections, but make some projections about how those values are going to change over time.
The IHME model is one of the more well-known ones. It’s not the only one, of course, but it’s one we’re going to try here. And adding it is not that hard. We can go to their website, download the data. And within here, this is dated August 21st, is a CSV file and we can simply copy it to a distributed storage, read and maybe maybe plot some values up.
So here we’ve extracted values like infections and deaths and plotted them here for Arizona. This is deaths. And you can see they’re a bit spiky because of weekly seasonality. But the predictions takeover right about here, on August 20th or so. And the IHME model predicts that the level of debt stays flat. That’s a value that’s pretty consistent with actuals, but it’s going to increase over time and maybe reverse later in the year.
So this might be useful in that we might be able to relate the progress of COVID state by state and explain some of it with this alternative data. And if we can explain the changes by state in terms of the COVID metrics, then of course we can make better predictions by using the predicted future values of those metrics in our own forecast.
So I guess you could argue this is also a form of transfer learning. We are using the output of one model, the IHME model, to improve the forecast of another model, our profit model. Good news is that it’s not hard to do at all. So we can take the data, again, for Arizona. We can add to it the values from the IHME model. Here, I’m just going to try infections. We could also add deaths and see what that does to the predictions, the forecast.
Same story here. We get another slightly different set of forecasts out for Arizona and visually does not look that different. It was actually pretty good to start, but it’s slightly different. And maybe we might think it tracks a little bit better, but we need to quantify that, of course. So again, we can evaluate this new set of predictions for Arizona. These are the originals without the IHME data. These are the new ones, the error metrics with IHME data. And they’re pretty similar.
I think you can see that it’s a little bit more accurate, maybe 1% overall. Certainly if you look at the 14 days out, it’s doing a little bit better with this new data set across the board. Not a massive effect, but a 1% increase in accuracy could be quite meaningful to a business. And certainly for having simply thrown some data into the model with a couple of lines of code, that’s not bad. That’s a win I think we’ll take.
But I think we can do better. And we can do better in a couple ways. Number one, we don’t want to make forecasts just for Arizona. Of course, we want to make forecasts for all 50 states. And maybe we want to try to tune this model a little better. Prophet really doesn’t need much tuning, but you can tune it. And so maybe we can throw some easy solutions at that to auto tune, do a little auto ML on Prophet itself.
So neither of these tools we’ve been using directly use Apache Spark, but it doesn’t mean Apache Spark doesn’t have a role here. Of course, producing predictions for all 50 states in this setup, those are independent problems. We can do those in parallel, and Spark’s good at doing things in parallel. So what we can do is leverage Spark to manage the data movement, to group data by state, for example, and apply our forecasting logic.
And you reuse the same code we’ve been using so far that didn’t actually use Spark itself. And in that way, we get the best of both worlds. We can leverage all these open-source tools. We can use existing code but still use Spark to gain parallelism in the scale. So that’s what we’re going to do here. We’re going to get all of the sales data and we’re joining that with the IHME predictions here. And then we do this.
It may look a little complex, but this is, I assure you, the most complex this is going to get. To explain what this is doing, let’s actually start from the bottom. Number one, we’re using Spark to group all this data by state and not just group it, but present it as Pandas data frames to Spark tasks, to the code we’re going to write here that does something for each state.
That’s nice because we were writing code in terms of Pandas before and now we can reuse that code as a result. So Sparks handling all the execution of this logic for all 50 states and all the data movement, we can simply write the logic we need to apply for each state. And the logic is pretty similar. We’re going to do the same thing with Prophet. We’re going to build a forecast. We’re going to evaluate it.
The one new element is this: this is using a tool called HyperOps to tune Prophet. Now, Prophet does not have many tuning knobs. It has a couple, and we might try tuning one of the important ones called change point prior scale. The meaning doesn’t matter here. Suffice to say, it can range from zero to small values. And we don’t know which one’s better. It might be different for each state.
So we’re going to let HyperOps do this for us and try a number of different values intelligently and just pick the best one for us. And having done that, we can get the best forecast, the best model that was created and evaluate it and log it as we did above with Mlflow, the model itself, its error, and the plot, too, for the state. And if we do that, happily, we get something like this.
We get a run for every single state here. This is nice, including its error metrics and the best parameter we chose, which is helpful. Even this view itself is nice for clicking through and seeing what the result was for each individual state. We can, of course, go back and log all of those 50 runs as 50 updates to 50 registered models for the 50 states.
And if we do that, we, again, get not just a model for Arizona, but models for other states too, all the 50 states here. So we just took a little bit of extra code here. We put a fair bit of process around creating models for states and getting them to production, for example. Now, having done that homework, the rest becomes easier. We can do more stuff. This has bought us a few things, some nice little things like, for example, having logged this with Mlflow.
You can load all that data from MLflow, all this information, as a Spark data frame. Given that, we can turn loose some of the built-in plotting tools in Databricks on that data. For example, maybe we want to see where the forecast error is high and low across these 50 states. That’s what we’ve done here. We are reading the MLflow data and plotting error by state using a US map.
And you can see, for example, that for some reason, North Dakota’s error is a little bit higher. But Arizona is doing pretty well after this tuning. That’s nice. But I think maybe more interesting is the fact that we can now write a robust production process here to generate all these forecasts. So we have generated the forecast in the sense that we’ve generated plots and so on, but now maybe we want to come back and generate those actual numbers as data in a data frame so we can go do something with them, save them, publish them, maybe make additional plots from them and so on.
So that’s what we’re doing here. And again, same idea. We are going to use Spark to group the data by state as Pandas data frames. And then we can simply use logic in terms of Pandas data frames, using the same sorts of code we have before to make predictions and return those. And this is nice because Prophet itself, again, works in terms of Pandas data frames.
So this integration here with Spark makes it possible to use these tools pretty much directly with something like Spark. That’s what we do here. We’re simply going to load the latest production model per state and make predictions for that state and return them. Simple enough. And if we do that, we get a bunch of data here. We get predictions for all 50 states for most of the rest of the year. And we can do a number of things with that.
For example, we could choose to aggregate the revenue, all these projections, roll them up and sum them across the whole country and see what that adds up to over time. So actuals proceed here through about August 16th and here are really the sum of the predictions for all the 50 states going forward. And this is bad news, I suppose, for Acme, but it’s good to know.
We can also maybe turn around and apply tools like Plotly again to make customized plots here. For example, maybe we want to overlay several of the projections for key markets here in the Southwestern United States. And we can do that too here, like so. And there are some funny discontinuities here that are, I think, really due to some jumps in the IHME models predictions. But I believe you get the idea here.
We have actuals and then we get the predictions going forward and we can see how maybe different states are behaving differently. Some seem to be not that affected like, I guess, Wyoming here. Some are in a bit of trouble like our own Arizona. Now, last but not least, we may want to share this output with business users and we can do that easily in Databricks. We can export these as dashboards.
So we can create a dashboard with the output of these last two cells and share just that. It’s still interactive. It still would update if we updated these. But this avoids having to share an entire complicated notebook with someone that just wants to see the plots, or maybe you want to put this up on the wall or something like that. That’s easy enough, too.
So in closing, what do we do here? Well, we started with a bad situation, a changing world, an uncertain world and primitive tools with which to forecast our business. And just about 40 cells of notebook here, maybe a week’s work. We applied modern off the shelf forecasting tools like Prophet to make much better predictions. We also managed to add alternative data sources like, for example, the IHME models predictions and it actually improves, with just a few lines of code, the predictions by maybe 1% or so.
And then we took that and really productionized it. We not only scaled it up with Spark to generate forecasts for all 50 states, but we automated it. So this can be run every day, every hour, if you like, and in an organized way. The experiments are tracked with MLflow and managed that way. And we showed how you can get to a production forecasting system, again, that’s automated and scales and also present yet results that are relevant to business users like these nice interactive plots. I hope that was compelling. Thank you.
How the Medical University of South Carolina scaled their machine learning practice
Thank you, Clemens and Sean. You both gave great tips and examples of how companies can move more quickly to scale, manage and augment their machine learning models. Now, let’s hear from Matt Turner, chief data officer of the Medical University of South Carolina, an integrated academic health sciences center. I’ll be speaking to Matt about how they scaled their machine learning practice and how the onset of COVID-19 affects the work they do.
All right. So I’m here with Matt Turner, chief data officer at Medical University of South Carolina, which I will refer to, Matt, as MUSC, if you don’t mind, for the rest of our conversation. So, first of all, Matt, before we dive into the details at a high level, describe the role and usage of machine learning at MUSC.
Absolutely, Ben. Thanks for having me. So yeah, Matt Turner, chief data officer here at MUSC in beautiful Charleston, South Carolina, and I really oversee our entire data practice. So everything from data management, data engineering, building our modern data ecosystem to building our modern AI workbench and all of our AI teams as well. So we’ve been on a bit of a journey here at MUSC at infusing machine learning into our advanced analytics platform and then working on a variety of problems.
So we’ve had use cases here from the financial realm, so doing a lot of forecasting lightly that we can talk about, into the deep clinical realm. So early identification of patients who are at risk of dying from sepsis, predicting which patients are going to get readmitted to the hospital, looking at who’s going to be a length of stay outlier, a number of deep applications, and even more recently predicting COVID-19 positivity. So a lot of various problems, but we’ve really tried to build it as a core capability as part of our ecosystem.
So as a chief data officer, you have a broader mandate, I expect that spans beyond machine learning. So your perspective is probably a lot broader than someone who’s just deep into building models and things like this. So to what extent, Matt, does that affect how you think of ML? So in terms of the types of tools that are needed to have a sustainable and repeatable ML practice?
Yeah. I think for us, it’s all about generating insights. Our sort of mission statement internal to our information solutions group is harnessing the power of information to improve the lives we touch. As the premier academic health center here in South Carolina, it’s our job to take care of all of the people across the state of South Carolina. And so for us, the best way to do that is take all of this data we have, treat it like a strategic asset and generate insights from it.
And machine learning is absolutely core to us being able to do that successfully at scale. I mean, so many of these problems and opportunities can be enriched by having a rich data set, but we only have so many hours in the day and so many people who can take these challenges on. So we’re always looking for ways to extract that next level of intelligence to drive the processes and the outcomes that we’re looking for.
So as the chief data officer, what sorts of foundational tools did you put in place in order to have a machine learning practice that’s strong and sustainable and repeatable?
Yeah. I think as a sort of died in the wool of an IT executive, you’ll hear us talk a lot about people process the technology all the time. That’s the trinity of IT management, I guess you would say. But for me, it really does start with the people. Without a good data science team, you can’t do this type of work. So we’ve invested in bringing in a data science team.
A gentleman named Matt Davis leads our practice. He’s a lead data scientist. He’s recruited a young pipeline of up-and-coming data scientists. We’ve built relationships here with numerous organizations. Charleston right here in our backyard. Clemson has a very strong engineering program. So building those talent pipelines and relationships.
In any given time, we’ll have four data science interns who are working in the cloud building machine learning. So sometimes they get a little bit of whiplash because they come in and they’re working on real problems and they expect to have a cursory internship. So we’re always looking to scale that workforce and get as many cycles. For us, we look at machine learning as an opportunity to create many, many models and really hone in on that top insight.
So putting the team together was number one, and we’ve spent a lot of time doing that. Secondly, I would say it is the technology platform. And so for us, that has been a heavy investment in what we call our modern data ecosystem. When you hear me talk about that, I’m talking about our full data estate for MUSC. We spent the last two years building out a platform on Microsoft Azure. Databricks is certainly a major component of that.
And building an end-to-end data solution for our entire organization, everything from modern business intelligence, analytics, dashboards, data warehousing, all of that is included in this architecture. So that platform has a component called the AI workbench. The AI workbench is a process where we’ve taken the core elements of healthcare. So think lab results, medications, surgeries, all these types of information, and in many cases, the physicians notes themselves.
And we’ve brought that into our enterprise data lake and then we’ve enriched and refined that data in such a way that we can apply machine learning to all of those signals together. So in our case, we’ve got seven years of history, at least in the cloud, for most of these elements. And it gives us a very rich foundation for which to do feature engineering. And I’m sure we can talk much more about that.
And so to what extent did you folks invest in a feature store, for example, or features that are specific to the types of problems that you guys have to tackle day in and day out?
We’ve spent a ton of time on this. For us, good feature enrichment and selection is, if there’s any secret sauce to what we’re doing here at MUFC, that’s absolutely it. Most of our organizations are using similar technology platforms. We sort of know the type of people to go out and find. I think we’ve been more successful than others in finding some of the best and the brightest, for sure.
But that talent has turned into a feature store that we’re very proud of. We think about this based on the point of prediction. So for us, that means in the disease state or the hospital encounter, there’s different times you want to intervene. So if I’m looking at chronic care management across the spectrum, managing diabetics who are at home or in the population, that data is screened very differently than if I’m in an ICU bed fighting for my life.
So we think about those as different temporal networks of features over time. So for the hospital, we have a feature store that is every patient for every hour of their encounter. And we’ve got all the details there, their vital signs, the medications that they’re on, that combination over time. We’ve got many, many rows for each patient visit.
If you’re looking more broadly at a population level, we’ll have feature stores that are set up with monthly intelligence, for instance, so that we’re monitoring and we’re looking over the full patient experience and we’re looking back and go, perhaps, how many opioid medication days did this patient have in the last quarter, the last year, the last two years? We’re looking back at that history over time. And so it’s not only, I think, the novel selection of the features that really can predict the outcome. It’s also the temporal sequence of those that’s very important to us.
So Matt, let’s take, for example, a classic task in any enterprise and any organizations like yours, which is forecasting. So you went from a world where you had forecasting models, and then COVID happened. And then obviously the forecasting models either no longer work or have to be drastically revamped. So two questions. The first is, have you folks had to revisit your forecasting models and to what extent has machine learning helped you there? And secondly, in many ways, is the forecasting model an example of what you just said, which is the importance of the feature store. But on the other hand, one should probably be aware that the feature store itself has never done so. So if it changes to the world, it means you may have to revisit what kinds of features you’re using in your forecasting model.
Absolutely. So I think what we’re talking about here is sort of our non-patient specific realm of machine learning. And so much of what we do is very patient centric and focused on an individual and generating features. When we think about these trends over time, I think that kind of quote that kind of came from my lead data scientist when COVID-19 first hit it.
He’s got a significant background working with NOAA and looking at weather patterns and those. He has a very interesting background coming into healthcare, which is great. But we’re on a team call and we’re looking at things going on and kind of looking at the impact here. And most of our volume indicators dropped by about 50% almost overnight. I mean, it was a totally unforeseen event in our healthcare system because we’re always at capacity.
As a level one trauma center, we really run max capacity all the time. And so for us to see those drop as we prepared for an onslaught, the curve, the overwhelming nature that unfortunately we’ve seen in so many of the US healthcare systems drop their volumes to the floor and really changed the nature of our business. And he called me up and he said, “Well, we forecasted for hurricanes before. We are in South Carolina.” He said, “Matt, this looks like five hurricanes in a row.”
I mean, it really looks like an outlier event in such a way that we can’t just tweak for this. And so we’ve rethought that. What I would say here is we’ve been fortunate in Charleston to bounce back. We’ve been fortunate to really avoid that onslaught of patients. Certainly we’ve had hundreds and thousands of patients who have come in for COVID-19 and care for them very well, but we don’t have that massive feeling of being overwhelmed, and we’ve been able to resume business in a really safe fashion.
We spent a lot of time preparing for COVID-19. We were one of the first drive-through testing centers in the country. We were well-supplied. We were well-planned. Our leadership really broke through during this crisis of pandemic. So we’re very fortunate for that. But specific to the forecast, although we’re moving back into more normal business, we have to rethink how we plan the business.
So for us, that’s going to be weekly, monthly, daily, continuously reforecasting. So in our system, we’ve built a rescoring function that’s always going to be running every week, every month. And now we’re building trust with the finance team that is recalibrating and looking at those forecasts. So we’re moving from a horizon of let’s look out for the whole year, maybe let’s look out for a quarter, and now, are we looking at the business continuously? And we’ve had to build systems that can support that.
So, Matt, how do you folks keep track of the growing number of models and experiments and tweaks and all the things that at the end of the day maybe the end users only see the polished product, but you folks probably went through a lot to get to that end state and also you learned a lot to get to that end state? So how do you keep track of the lessons learned along the way towards building and deploying one of these models?
Well, I would say, we’re big fans of MLflow, for sure. And we’re definitely using that capability in our AI workbench. You’re right, Ben. We track hundreds of runs of models for each project. I mean, so much of what we do is based on looking at almost any type of technique or approach. We’re not dogmatic at all in the types of libraries or approaches that we’re using.
We’re going to be looking across the board. And yes, we try to make that very opaque to our users. They see a final, polished model with awesome precision and recall and all those good statistics we love to look at. But we do keep that lineage of history. And so for us, the ability to have very strong model lineage, the ability to have those feature optimization steps, to have all of that logged for our AI intelligence has been critical and MLflow certainly helps us do that.
There are many inventors and scientists who say that a lot of the best lessons and learnings come from some of these failed experiments, right?
That’s right. Maybe we give credit to Edison with the light bulbs, but I think I told people, “I know a hundred thousand ways not to build a COVID-19 symptom tracker, but I’ve got a couple that work as well.” So we’re very much of that mindset. And our data science team will jump into modeling pretty quickly, probably a lot more quickly than we used to, where we would really deeply define the problem.
One of the nice things about having this workbench setup is it let’s us get the model one very fast and we can iterate through that. We’ll have some very talented physicians or operational leaders that will partner with us and can go on that journey. And when you can start to spit out a model in two or three days and then get feedback, it entirely changes the conversation about how we use ML in the organization.
So what processes and tools do you have in place in order to review and test models? In classic software engineering, there’s a long history of testing and all of these things. In ML, now you have code and you have data. So how do you make sure, especially in your case, because you’re in a highly regulated industry, that the model is ready to go live?
That’s absolutely right. So in healthcare, we’ve got to be fast and we’ve got to be good, both at the same time, as many other industries do as well. And so to make sure that that is safe and that the predictions are doing what they’re supposed to do, we go through a number of steps. The first thing we do very early on in the process is we have a clinical champion who deeply knows that subject matter.
So if I’m working on a detailed, let’s say readmissions model for a heart failure, a chronic condition that is one of the most difficult to treat over time, I’m going to be working with a world-class cardiologist who knows this disease, state forwards and backwards. We’re lucky here at MUSC because we have academic leaders who are also deeply smart clinicians.
And so they’re going to have some familiarity with publishing and with understanding the statistics and the pieces that go into that. So I certainly admit my privilege in that regard. But those folks will step in and understand quantitatively how the model is working. We’ll be very thorough in our use of test and training sets and make sure that all that’s well covered.
And then we’ll go into a clinical validation process. Our clinical validation team is an informatics group here within the MUSC who are clinical IT experts who look at workflow and the benefits realization of these processes. And so they will do a sampling. They will review those charts for efficacy. So they’re going to read through the notes and make sure that what the model is producing lines up with clinical relativity.
So I think that process will also shadow this in our electronic health record. So if we build a new model and we go through clinical validation and things are looking good, we have the ability to implement this into our Epic EHR. And then this will actually track in the background. Typically we’ll do 45 days to 60 days, and then we will take those real patient predictions and do a validation step on those as well before we actually go to production.
You’re in health care and as you said, it’s important for people to understand how the models work. But how much tolerance is there for some of these more black-boxy models in healthcare?
I totally hear that phrase a lot. And as you can imagine, I have physician scientists who want to know precisely how these systems work. They’re intensely skeptical. It’s one of the things I love most about them. They want to understand the evidence. They want to understand the true efficacy of this. They really treat the models as if they were a different type of clinical intervention.
We’re spending a lot of time in this country right now talking about vaccines and trials. We’re looking to run some COVID-19 vaccine trials here at MUSC, we announced yesterday. So when we start to look at the efficacy of things, we think about our predictive models that way. We put them up to intense scrutiny, to peer review, to make sure that we can replicate these things so that those are going into practice.
And one of the things that our team has done that I’m very proud of is model interoperability. So we don’t just build a black box neural net that can’t explain what’s going into it. We spend a lot of time working with Shapley attributions and feature importance. Every model we’re showing feature importance and we’re actually showing at an individual prediction level top features that went into that.
And there’s even some ways we’ve implemented that into our electronic health record so you can mouse over a prediction and you can see what was driving that. That has been critical to understanding what really drives the model from a design perspective, but more importantly, how it impacts that precise prediction for that patient. So we spend a lot of time making sure that that’s understood and not a black box.
So as you know, machine learning, there’s a lot of steps that go towards building and deploying models, including what you just described, which is the model explainability stage. So you can imagine a machine learning platform like yours, your workbench having many, many components. So question, what’s your criteria for build versus buy? So for example, in what you just described, model explainability. So there are some companies and startups who provide these tools. So what areas and what’s the exact criteria for when you decide to build versus buy?
Yeah. I think for us we talk about building ecosystems and platforms that improve MUSC. And what that means is developing partnerships with companies who can provide that next level of platform and infrastructure that allows us to move forward faster. So that’s typically when we are going to buy something. When we look at our cloud journey, our entire strategy around leveraging the cloud to build healthcare systems that can improve care, that’s a spot where we’re not going to build that capability here locally in MUSC.
So working with some of those platforms have been an accelerator to get into a technology stack, be it a Data Lake or Spark computing or those types of deliverables. When it comes to the models themselves, we’ve been much more prone to build. And I think that’s some because we’ve been fortunate to have the talent and the focus. But we have better results that way because we can see that these solutions make sense to our clinicians.
We know them intimately. We understand how they’re made. We can see the lineage. We own that code. We’ve got a really thorough understanding there. And that builds trust in the system. And so in some cases we’re taking that on very heavily. It’s also when we take off the shelf. There’s been a lot of success by the electronic health record companies, Epic, Cerner and many others that I won’t leave out; they’re starting to build some early predictive models into their tools. We worked very closely with those partners.
So Epic’s a big partner of ours. We’re on their advisory group for their cognitive computing platform and we spend a lot of time working with their data scientists up in Wisconsin. But we believe that the best model is going to be built and tuned on the patients in South Carolina. So because we have this rich data set, we believe that building those locally is going to produce the best result.
So give us a sense of where you are in terms of your ML journey by describing, I guess… So one lens, Matt, I guess that’s useful would be so there’s a bunch of models, maybe the forecasting model, that rely mostly on structured data, and then there’s unstructured data, which could be physicians notes or medical images and maybe even audio. So what kinds of data sources are you now using for your machine learning models?
That’s great. I feel like we’ve really got our feet under us with the structure down, with the pieces that we’ve talked about. We’ve built double-digit models on our workbench. We have various stages of models within clinical validation and they’re really proud of that work. So that’s given us an opportunity to step into some new less structured pieces. One of the areas that we’re very focused on could be structured data, but it’s streaming data.
So really trying to speed up the predictions. We opened up a new children’s hospital this year, the Sean Jenkins Children’s Hospital, really one of the most world-class facilities for caring for kids and the destination facility here, especially in the state of South Carolina. And as part of that, as part of our digital transformation here at MUSC, there’s a technology plan that goes with that new facility, which produces data at a rate far beyond what we’ve had in our traditional environment.
And that extends from everything from patient engagement systems to a number of real-time monitoring for clinical conditions. So our goal has been to stream that data into our model from that ecosystem, reasonably where it continuously and speed up that timely insight. So that’s a streaming area. And then the next frontier I would speak about would be our utilization of notes and images, particularly notes. We’re spending a heavy time here.
This is the foundation of what a physician, what a nurse puts into the electronic health record. And it’s been sorely under utilized. There’s been a number of people who’ve tried to exploit that, but we believe that using natural language processing and monitoring machine learning techniques, we can start to extract features, re-board embeddings here at MUSC that will speed up the time to prediction as well and really take full use of all of that good medical intelligence that’s going in there.
And then beyond that, I think you’ll see us start to, with certain partners, expand out into looking at medical imaging. We’ve worked with a number of companies. We have a deep partnership with Siemens Health Mirrors to look at use of images across the health system and to leverage our vendor-neutral archive in our PAC systems. We have had digital images since the ’90s here at MUSC. So we have a rich history of images that we’re looking to tap into.
By the way, speaking text, my understanding based on conversations with people who work on NLP in healthcare is that actually it’s quite challenging because there are so many sub-disciplines within healthcare. They all have their own lingo and shortcuts for expressing themselves. So I guess basically you can’t take an off-the-shelf tool, models in the cloud and just point it at your data. It just doesn’t work well.
I think that’s right. And I think that you’ve seen some interest in the industry from trying to take that approach and really train models on massive sets of texts. I mean, I think one of the holdups to that is there’s great sensitivity and privacy considerations on these nodes. Facilities don’t want to just take all of their notes and turn those over to a tech company that can build those at scale.
So we have to make sure that we treat that appropriately. And you’re also right, there’s a number of nuances. There’s deep understanding. There’s a history in medicine of learning how to write notes and how to author those and do that and there’s a number of good works. We very often will train our models based on a medical continuum or a medical list of things.
So if you’re familiar with PubMed, for instance, where you have all your medical literature, a lot of our word embeddings are going to be created off PubMed populations versus millions and billions of characters from books or something like a more traditional use case because that medical language has a unique flavor in and of itself.
And actually one of the main NLP tasks that’s very much specific to healthcare in many ways is this thing called de-identification, right?
That’s right. Because if you look at the de-identification and the privacy issues in healthcare, it’s very difficult to anonymize notes, particularly at that scale. I mean, there’s a number of research projects that have augmented re-identification there. So it’s definitely an issue of consideration. I’m very fortunate here at MUSC that we work very closely with our information security team.
Our cyber folks are engaged from day one in architecture. Our data architect, our machine learning architects, our cyber security architects, they’re all working together to plan this so that this system works with full security privacy in mind because it’s a real risk area for us right now.
So you’ve mentioned privacy a few times, that you folks have a lot of checks and systems in place for privacy. What about other considerations, Matt, like fairness, for example?
Yeah. I think so much of what we think about is the ethics behind machine learning and artificial intelligence. We have a group that has been spun up to look at this that we look at. We have medical ethicists and quality experts, diversity inclusion scholars. We have a number of people who are coming in here to look at, are we introducing bias into our models?
Are we taking our implicit bias and are we making it into our models? Are we taking years and years of clinical history and does that creep in? So we’re very interested in these topics. Looking at equity across all of the data products that we do, we’re making sure that we have underserved populations, communities of color, that those are not disproportionately impacted here.
So I think all of that is critical. And also, I think we want to make sure that we have a good lineage of all these things. So if we’re making predictions that these are clinical decision support tools, that we have a lineage of, what was that data in context and time? And why do we make decisions based on that? So we’re really building intelligent decision systems that can be audited and reviewed and carefully considered as well.
So as a chief data officer, I’m sure you’ve given a lot of thought to data governance and tools for data governance. Are you starting to think about modeled governance as well?
Absolutely. You’re right. One of the pleasures I have is to chair our data governance group and work very closely with our data stewards and information stewards across the organization. And as we think about models are becoming increasingly one of our most powerful data assets. So if we look at our data warehouse, steamers and models, we call those our data assets.
If we look at our models, I think we have to treat those in the same way. They have a lineage, they have to be stewarded. They have to be looked after and cared for. And increasingly now we’re training them and retraining them over time. And so AI governance, model governance is absolutely on our roadmap and connects very closely to our data governance strategy
And to close, so you hinted earlier about the streaming, that you folks are moving more and more into streaming. So how do you envision what you’re doing in machine learning, interacting with streaming down the road?
So for us, it’s all about the speed to that prediction, the speed to the insight. So we built a base that can produce these things at scale. With streaming, we’re going to get more data and we’re going to get it faster. And so for us, that’s building pipelines that can scale across the organization. As that data comes in, it’s made us look at new pieces of technology, IoT connectors, event hubs, streaming data frames, all of those types of capabilities we have to build in.
We’ve also looked at industry standards spaces as well. So in healthcare, we always look at a product or a set of standards called FHIR from HL7 because that builds a standard data model for which we can stream these device inputs into the cloud. So I think that’s the next level of discovery for us, but we believe that our foundation is going to allow us to bring in those signals more quickly.
So often healthcare has been looking in the rear view mirror. We’ve been looking back at patient surveys and financial reports and payer claims files that come in. Those things can lag for months. And so we’re looking for, what’s the signal today in our health system? How can we get better each and every day so that we can make an impact for the patient that’s right there in that exam room, in that clinic you know, in that hospital bed.
Actually, as you were talking there, I thought of one more thing that we haven’t quite touched on yet, which is UX. Because especially with streaming, you don’t want someone to be drowning in metrics and alerts. There’s a fine line between giving people timely and too much information.
That’s right. I mean, so often some of the feedback I get as a chief data officer here is, “Matt, we have too much data.”
Show me only what you think I should know.
That’s right. We so often want to find out that signal in the noise. And so for us, alert fatigue is a real concern. Our physicians have so much information coming at them at all times. Our job is to really harness all that information and really turn it into deep intelligence and insights. And that’s one reason we’ve also built many of our models, it’s so that we can really dial in that sensitivity and specificity to drive a good alerting strategy.
So we’re not sending more alerts, we’re sending smarter alerts. Very often from that UI/UX perspective also, we’re building classification systems instead of learning systems. So our physicians don’t need things going beep at them in the night. I’ve had that happen too often. And so often what we’re doing is we’re showing them context. I have to think of it as augmented intelligence so that you’re showing that with them.
So I’ve got 20 patients on this unit, what’s the most risky to develop sepsis within the next 24 hours? Who’s the least risky? Those types of systems have worked very well. And you’re right, we have to build those with good design principles. We have to use advanced data visualization techniques to really focus that in so that it’s not overwhelming.
And with that, thank you, Matt. And it’s great to have the short conversation with you, because I can tell how passionate and engaged you are with the mission of your group and of the Medical University of South Carolina.
Ben, thank you so much for having me. I really enjoyed it.
Thank you everyone for attending this webinar. We’re now going to have a live Q&A session with Marijse Christos. So please, if you have any questions, feel free to add them in the chat box and we will reply to your questions.
What is the mechanism to control the personal information of patients? Which technology are you using?
So just in terms of personal information on platforms, so I guess it’s important to understand that Databricks’ platform doesn’t actually store any data. So that’s always stored in the underlying cloud that you have your data in. So in terms of storing personal information, the anonymizing of that data is up to the customer. However, Databricks’ platform can be compliant with standards like HIPAA, which are quite common in the USA, but also GDPR compliant, which can also be helped with the use of Delta, for instance. I hope that answers the question sufficiently.
Is Databricks being used by any public authorities, tax administrations across the globe for e-invoicing, e-filing purposes to analyze data?
So I do know that Databricks is being used by government authorities in the US, but also in the UK and Europe, I believe. So that’s governments, public authorities. Whether or not specifically the tax administration, I’m not sure. However, I do believe that Databricks is being used for lots of analysis of data, also text data invoicing. So I wouldn’t be surprised if we have particular use cases around that.
Do you have use cases in the supply chain and logistics industry, especially aviation supply chain?
Yes, we do. So there are a couple of airlines we work with also regarding supply chain and logistics. For exact examples, I would actually refer to the same page on our website. That will give you more information around both. I think the aviation example is on there. But if not, there’s definitely examples around supply chain and logistics on our website as well.