Building a Data Science as a Service Platform in Azure with Databricks

May 28, 2021 11:05 AM (PT)

Download Slides

Machine learning in the enterprise is rarely delivered by a single team. In order to enable Machine Learning across an organisation you need to target a variety of different skills, processes, technologies, and maturities. To do this is incredibly hard and requires a composite of different techniques to deliver a single platform which empowers all users to build and deploy machine learning models.

 

In this session we discuss how Azure & Databricks enables a Data Science as a Service platform. We look at how a DSaaS platform is empowering users of all abilities to build models, deploy models and enabling organisations to realise and return on investment earlier.

In this session watch:
Terry McCann, Director, Advancing Analytics

 

Transcript

Terry McCann: Hello, and welcome to my session on creating a data science as a service platform in Azure and in particular with Azure Databricks.
My name is Terry McCann. I am the Director of Artificial Intelligence for a company based in the UK called Advancing Analytics. I’m also Microsoft MVP for AI, which basically means I just spend way too much time at events like this.
So what are we going to look at today in this session? So, what we’re going to be doing is going through a couple of different things. So we’re going to be kicking off by discussing a little bit about some of the data science personas. We’ll move on to where the challenge is in trying to orchestrate and architect an environment for enabling all of these different personas. We’re then going to move into talking about how we can enable all of these users, and we’ll go through a couple of different examples looking at Azure Machine Learning services, MLFlow, and Azure Databricks.
So personas in data science. Now there are a whole huge range of different personas in data science. In the last couple of years, we’ve seen a huge explosion in coming up with different terms and different roles associated with the explosion of different roles around machine learning and data science in general. So machine learning engineers, data scientists, modelers, these kind of roles, and focusing just on that group of personas is kind of where most people are at the moment, and where most people suffer. And when I say suffer, that’s because we’re not enabling some of those data science personas which might be a little bit hidden inside your organization, and those are the ones that we want to focus on today.
So let’s go through a couple of them. So we’ve got our data scientists, and they are typically very experienced in machine learning, they’re deep experienced in shallow machine learning as well as deep learning. They’ve got all the Python skills that they need in order to build models appropriate for your business and your domain. They’re very comfortable in writing notebooks. However, they might not be as comfortable in actually deploying an API and figuring out what they need to do to manage that API effectively.
Then moving further down our chain, we’ve got our data analysts. Now our data analysts aren’t typically a user we might associate with having to try to enable with machine learning because they don’t fall into that core group of our machine learning developers and our machine learning engineers. But what they do have is they have a huge amount of experience in data as well as SQL. They’ve got a lot of experience in other areas as well, such as data visualization, but they’re probably a bit naive to the process of machine learning. And they may have a good interest, but they don’t really know how to apply it, but they know how they would apply it to their domain if they have the skills. They’re also probably a little bit short on Python, and R, and the typical machine learning stack. So it’s a lot harder to enable this type of user. But if we do, they’ve got lots of rich domain knowledge that we might be able to take advantage of.
Going slightly higher than that role, we’ve got our senior analysts. So they’re the more senior version of our data analysts. They’ve got more experience in SQL, data, data visualization. And they might have a basic understanding of machine learning, and the least of the seven steps that we typically go through of machine learning. They’ve got a deep interest in machine learning, but again, they might not know how to apply that by taking advantage of notebooks and of handwriting Python. However, if they had a user interface, they might be able to actually do that in an easier way and apply those skills without necessarily having that Python knowledge. And by enabling them, again, we might be able to take advantage of some really good core skills.
And then the last persona, which is often overlooked but it’s also super important to make sure that we’re not missing, is that of our advanced business users. As a consultancy, we work with companies across the board, but we work with a lot of financial services institutions. And whenever you work with financial services, if that’s insurance, or that’s banking, what you always end up with is you work with actuarials who model risk.
And when you work with people who model risk, they are typically very deep statistics and they have a deep understanding of modeling, but it might not be the same as machine learning. It might be more statistical-based or they might do some machine learning. But being able to enable them, the jump to machine learning is a lot easier for them because they’ve got this way of working already. They might not just know the process and they typically are quite experienced in Python or R. Typically we find R, actually. They’re a lot more experienced in R than they are necessarily with Python.
So we need to be enabling multi languages as well as multi-skills. There’s a lot of different competing priorities.
So, if we try and plot these different personas, we can do that on these two axes. We’ve got machine learning maturity and complexity. Typically as complexity increases, as does maturity, and you’re able to take on new things. By the same idea there, as you get more mature, typically the approach that you take is more complex, but it doesn’t necessarily need to be. You end up with a kind of curve that looks like this, this curve trending upwards and going quite steep quite quickly.
And if we start to plot our various different users, we start here with our data analysts who don’t necessarily have that maturity and their complexity is quite low. They’re looking to potentially solve some naive problems in your business, but they will probably solve a lot of business problems. There might be quite low-hanging fruit, but that’s good problems that get solved.
Moving up to our senior analysts, their maturity increases slowly, and their complexity might increase again a little bit to match it. But again, they’re trying to solve these quite naive and quite low hanging fruit problems.
As we start moving up into those business users, that’s where we quickly start seeing complexity increasing, ultimately until we’re at that top of our tier where we’re talking about our data scientists and our machine learning engineers. And so we need an environment that can cater for all of these roles, increasing along this maturity curve.
So where is our challenge? I mean, that is our challenge, is that this kind of architecture is hard. There are huge amount of competing priorities that need to be considered in order to capture this and get it working effectively. So if we think about what we need to do to enable these roles, our data analysts, deep in SQL, but they might not have the machine learning know-how or the Python experience to be able to build a model, but they’ve got the ideas. One way that we can help enable them is through automated machine learning or AutoML. By generating out hundreds of different models, trying different approaches, we should be able to come up with an okay model to try and do what they’re doing. And if they can get that incorporated back into the line of business processes, then we’ve got a really good enabled business user who’s taken advantage of machine learning.
Moving up, we might decide there that actually our senior analysts, they know more about the process, so they don’t need AutoML. AutoML still might help them, but actually having a GUI based way of working is actually going to enable them so much better. So let’s give them a GUI.
Moving up in that advanced business users, we don’t know where they’re building their models. They’re probably training them offline. They might be using some desktop software that we don’t have available in the cloud, but what they can do is they can give us a model and we can get it deployed. There, we want to be enabling any model in any kind of deployment and getting that served to abstract the whole process away from them.
And then our last part there is really around notebook development and real-time. So our data scientists want that notebook experience, and they need a real-time API that they can service, that they can hit and get their responses back.
And so this is the challenge. This is difficult. How are we going to enable this? We’ve got a couple of key tech, we’ve got Databricks. Databricks is where we’re going to spend a lot of time trying to enable these users. But trying to do this across one platform is impossible. We need a platform that uses a kind of polyglot approach, take the best tech that works to solve that problem. That’s where we start looking at MLflow and also taking advantage of Azure Machine Learning. If you’re in Azure and you’ve got Azure Databricks, then it just works.
And so what we’re going to look to do is, we’re going to look to take the best parts of each of these and map them. So here we’ve got AutoML and our GUI interface being backed up by Azure Machine Learning, whereas any kind of model serving as well as our real-time is being delivered via Databricks.
So now let’s actually take a look at how this is going to work and run through a couple of different examples and demos. So here we are over in my desktop, and where I am now is I’m in Azure. So I’m in the Azure portal and what I’ve got is I’ve got a resource group and inside that resource group, I’ve got an Azure Machine Learning services instance already created. And what I can do is just hit Launch studio, and that’s going to drop me right into their landing page where everything is built from there onwards.
And so going back to the personas that we’re trying to enable, first off, we’ve got our report developers, we’ve got our data analyst. With them we said, we want to be looking at AutoML. So one of the first parts we’ve got over here, over in our services here, we’ve got this guy, we’ve got AutoML. So if we click into here, what AutoML is going to do for us, is it’s going to take our data, whatever that is, and it’s going to try a whole load of different approaches until we can try to get to an idea or a good model that works for the problem that we’re trying to solve.
So I can come up here and I can say, “I want to run a new AutoML run.” And now I’ve got a couple of different datasets. And for each example that we’re going to go through, I’ve got the same data so it’s completely consistent. And we’re looking at the bike rental data, which I think you can get from UCL, but it’s also a Databricks data set so you can just go and get that if you’re inside any Databricks environment. And I’ll show you how to do that in a moment.
Now, this data is incredibly leaky. So I’ve got two versions here. I’ve got one which doesn’t contain any of the leaky data, which basically means if you add these two columns up, they equal the label that we’re trying to predict, and we definitely don’t want to do that. And then I’ve got the version which doesn’t include that. So if we take an example here, I’m going to take this bike hourly with no leak, and I hit Next. And what it’s going to do here, it’s going to say, “I want to start configuring my run.” And now everything in Azure ML is driven by an experiment and that’s how you can track what’s happening. So I’m going to say, “Right, I want you to use my AutoML experiment, and the column I’m trying to predict is this one down here, it’s the count.” Now I’ve got a compute cluster already created, and that is a scale set of virtual machines that you can scale up to, as and when you need to, and scale right back down, so you’re not having to pay for too much at any one time.
If I hit Next, it will then drop me in to saying, “Well, what are you actually trying to do? Are you trying to do a classification problem? Are you trying to do a regression problem, or you’re trying to do a time series?” So am I trying to predict a label for our classification, which is categorical? Is it a binary decision? Am I trying to predict something numerical? Or am I trying to predict something across a time series? So it’s defaulted over here to thinking that my data is time series, but actually what I want to do is change that to say it’s regression. And then what I would do is hit Finish, and that’s going to start spinning everything up so that we can see all of this running.
Now, I’m not going to do that, but I’m going to take us in and have a look at one that I’ve finished running slightly before. And so here what I’ve got is a couple of different runs under my AutoML. And I can see here, I’ve got this Run ID. It’s completed and it took an hour and 18 minutes to run. Now, if I click into that and choose models at the top, what this has done is it’s gone away and generated a huge variety of different models.
So here you can see I’ve got this big list here. And each one of these, as I page through all of these options, is a different model that was attempted. If you go right back to the very beginning… Let’s go right back here. Right back… and choose the one at the top, what we can do is we can click into here and we can choose our metrics and we can start having a look at how well this model is doing.
Now, I know for this particular model that a root mean squared error around about 30 is really good. That’s a really good model. It’s a really good output. So we’ve AutoML, we’ve got to 41. That’s pretty good, but we probably could do better. But with this, I’m able to train a whole lot of models and get something that works pretty well without necessarily having a huge understanding of machine learning.
Now, as my complexity increases after I’ve started working with this, and I might want to try something else thinking about enabling that senior analyst, I can hop over here and move into my designer experience. And so if I click into my designer, again, I’ve got a pre-trained model here already. And so this is that GUI-based way of working. With each one of these, I can drag them around and click into each one of them and configure them independently.
And so again, I’ve got my same data set here, albeit this is the one with those leaky columns. So what I’m going to do is I’m going to remove those columns and then I’m going to bring in a boosted decision tree regressor. And with that, I’m going to use that to train my model, and I can go in and do the normal things I’d want to do, split my data, take 70% of it down. Or in this example here, actually 80% down into train my model and pull the rest down into score my model. And then I can actually have a look at the evaluation coming out of the back of this. If I click down in here, it’s always an awkward one to click around, we’ll eventually get the menu to come up, which is visualize.
From here, I can see that my root mean squared error for this particular model is 39. It’s almost 40, which is significantly less than what we were seeing before. So with each approach, we’re able to slightly get a better way of working. And now if I wanted to, I could come in here and I could choose to extend this. And I might say, “Actually, what I want to do is I want to pull on a linear regressor as well.” And so here I could pull that in and then I need to do a couple of additional things here. I need to score my model and I need to train my model. And what I can do here, come down to our train as well, and I can just start hooking these up, just start completely repeating the same process that I’ve got on the left side, now on the right. And with that, get my model hooked all the way up, bring this down, click into my Train and say, “Well, what label am I trying to predict?”
And as we’ve mentioned before, this is the count, and hit Save. And now, if I submit this, it’s going to go away and it’s going to, again, do the exact same thing, but now it’s going to compare two models and I can see which way of working and what’s working best, and then have the ability to start changing and tweaking this as well. So this is really giving you the ability to, again, enable that slightly more advanced user who’s familiar with the process, but might not have all of the skills to do this in a notebook or another environment.
And one of the better things that you have in here as well is you can hit Publish. And once you’re published, that can go into a Docker container, and be deployed onto Kubernetes, and scaled out, and load balanced, and work incredibly well for you. Or, it can hook back into Azure Data Factory and be scheduled on an automated run. So that helps you with those kinds of audiences.
But when we want to get a little bit deeper, we need to start looking at notebooks. So shifting gears, we’re going to have a look in Azure Databricks. And so I’ve got a notebook here, which is again, going to do that exact same process that we looked at in a GUI, but now we’re going to look at doing it in pure code. So what we’ve got is, we’re going to be mainly looking at this, the MLflow model registry.
So I’ve got a couple of bits and pieces happening here, starting, I’m importing a variety of different packages. So I’ve got pyspark.ml. I’m going to build this all out in a PySpark model. We could use XGBoost. We could use any other library that we wanted to. And what I’m going to do is I’m going to read in some data. And this is that link to that data set that is just available inside Databricks. So you can run this all through. And then I’ve got this bit here, which is a function, which is just going to create a machine learning model for me. And in the same way, it’s going to come in and it’s going to do a test/train split. It’s going to split my data 70:30. It’s going to pop that through a model, which is a gradient boosted tree, a gradient boosted aggressor. And it’s going to start capturing a load of vital information about our model.
Once that’s run, that’s going to start enabling this experimentation tracking up here. And with that enabled, what we’re able to start seeing is logging out the metrics that are being generated by each machine learning run. For the latest version I have got, I’ve got an rmse of 124, that’s shot right back up, so that’s not a very good model.
The version before that was 54.66, what did I do differently? Well, I changed a couple of parameters and those parameters are here. I can start seeing the number of iterations and trees. So it’s 15 and 15. However, it was five and five here, so that’s not quite working. I probably need to increase those again to get a better model.
And then once our model was logged, so one I expect down here is I’m using Spark, I’m using MLflow spark log model. And that’s going to log my model into the registry. From there, I can simplify that use case for my business user to say, you give me a model, I’ll get it logged in MLflow, And then once it’s there, our scoring script is entirely independent to what your model is doing.
That means if you’re constantly changing that process and you’re updating that model, you can keep doing that independent to my scoring logic, and we have a complete separation and there’s no awful dependency problems. I’ve got a nice decoupled process. So with this now, my model would just automatically get published into Azure ML… Sorry… into MLflow.
All of these names are great, so very similar. You kind of just get stuck on all of them.
And so what I can do is I can start saying, “Well, actually what I’d like to do is transition this model into staging. So with MLflow, you’ve got just un-versioned models. You can have any number of those, or you can have a model which is in staging. Or a model which is in production. And generally, you have unstaged stuff, it goes into staging, and then into production. And then once it’s flagged as production, any run after that can say, “Well, actually just go and get me a production model.” And then use that for scoring.
And so if something just as simple as a couple of lines of code here, I’m able to say, “Go and fetch my model, and this is my model URL, go and fetch that, and instantiate this model production.” Takes a couple of minutes, [inaudible] about half a minute to actually run. But then once it’s there, this now is just as if it was trained. I can just say, “I would like to transform this. This is Spark ML, and transform everything.” And then I get my predictions. And sure enough, I can come over here and I can see, and I’ve got a nice column of predictions. And so that’s all just being mapped for me. So with that, we’ve got that complete separation and we’re able to enable that type of user.
Now, the last one about enabling a rest API is just an extension of this. Without having to do any additional work, what I can do is I can come over here to models, I can choose the model that I was working on, which is this one here, bike model MVP2. And what I’m able to do is I’ve got this option up here, which only recently was enabled, which is serving. And if you click into here, it will ask you to create a new serving environment. I’ve already got one created, so spins up a different cluster for you. And then it gives you this kind of environment, which if you’re used to API development and you’ve used things like Swagger, it kind of feels a bit similar. It gives you the environment that you can [inaudible] it. See how it’s doing, check latency, check all these factors that you’re interested in. And also, just give somebody the ability to kind of play around with your model a little bit.
So what I’ve got in here is I could paste in an API request. So here I’ve got a sample request, and what I’m going to do is I’m just going to put that in square brackets because it will expect an array of requests. And I can say, “Send request.” And then over here, what I’ve got is my response immediately back. And again, this isn’t a response that’s going back to your Databricks cluster. This is going to a separate rest API, and then servicing the result back into Databricks. And you can extend this, you could add multiple in here. You could copy and paste and get multiple scores coming back through here. So we can pop another one in here, put a little square bracket on, send request. There we start seeing I’ve got two responses back. And however your model is configured, that’s great. It’s going to work for you. And you’ve got a rest API that is up and running.
One last thing that you can do is step entirely away from Databricks, and you can just spin this up as any kind of API call, as long as you’re using something which is going to hear a request. So I’ve got a couple of different functions here, which are just setting stuff up so that I can interact with that. And it’s basically just doing JSON post-requests against that API. And I can say, “Here, what I’d like to do is I’d like to submit a number of different requests as JSON, and I’d like to call this score model and pass that in.” And then immediately, again, what I’m able to see are all of the different results that I have available to me.
So with that, we’re able to go from AutoML, to a GUI way of working, to deploying a model, and then ultimately deploying a model inside a rest API, all within a couple of minutes. This stuff is quite easy when you’re looking to take advantage of each one of these different elements.
So that was a tour of a whole load of different ways in which you can enable the variety of different personas. And so what I would say is, whenever you’re looking to build out an architecture such as this, think about meeting everybody where they are. We have an architecture that really supports multiple different data science personas.q2 By enabling people across the organization to start getting advanced with machine learning, you’re able to start doing a lot of great things. Enable everyone to achieve more with advanced analytics and artificial intelligence through some of these simple steps.
So, thanks very much for taking the time to listen to my talk. Here’s a quick snapshot of some of my contact details. If you want to get in touch with me, please do. If you want to get in touch with Advancing Analytics, again, you’ve got all of the details there. And yeah, it’d be great to continue the conversation. Hopefully, I’ve been able to answer some of your questions in the chat. If not, do shoot me an email or drop me a message. One last thing to always talk about is feedback is incredibly important for everybody, helps everybody learn positive, negative. Everybody learns from feedback, so please make sure you fill in your feedback. And yeah, have a great conference.
Thanks very much.

Terry McCann

Terry is a Microsoft Artificial Intelligence MVP, awarded in recognition of his contributions to the Microsoft Data Science & Artificial Intelligence communities. His focus is on all things AI and Dat...
Read more