Learn to Use Databricks for the Full ML Lifecycle

May 27, 2021 03:15 PM (PT)

Download Slides

Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. In this talk, learn how to operationalize ML across the full lifecycle with Databricks Machine Learning.

In this session watch:
Rafi Kurlansik, Senior Solutions Architect, Databricks

 

Transcript

Rafi Kurlansik: Hello everyone. My name is Rafi Kurlansik, I am a Senior Solutions Architect here at Databricks and today I’m going to be going through a product deep-dive on how to use Databricks for the Full ML Lifecycle. So we’re going to talk a little bit with some slides but the vast majority of this is going to be actually in the product, showing you how to go through a particular workflow and manage machine learning models across their lifecycle.
All right, so for those who may not be familiar, the Databricks platform is a Lakehouse platform. What this means is that it allows you to work on all the different use cases that your data team might want to take on, whether it’s on data engineering, all the way through to BI and SQL Analytics including data science and things like that.
Now, today, what we’re going to be talking about is machine learning on the Lakehouse. When we think about… When we’re going through this demo here today and we’re walking through the product, I want people to keep in mind the different themes that we’re really going to be touching on. Now, if we think about what is essential for sound machine learning operations, we need to think about the following things. We need robust data processing and management, make sure that our pipelines can scale, set us up for success in the future as our data lines might grow. We need to have secure collaboration, making sure the right people have access to the right data and the right code and models at the right time.
Testing is very, very important, not just of code but also testing the data sets that we use for training and the models themselves to make sure that they have certain documentation associated with them. And of course monitoring is very important to maintain performance over time but the reproducibility is important just for the sake… If nothing else for the sake of debugging when you have to go back and look at things. And then documentation is essential in order to facilitate collaboration but then also to maintain compliance with corporate standards and legal standards.
So we’re going to be touching upon all those things as we go through the demo and to get started with the demo, let’s talk about what the business problem is that we’re going to be solving, right? So we’re not going to be doing any machine learning or any data science unless we have an actual problem that we’re going to be solving. So in this case, imagine that we are on a marketing analytics team and we have a lot of historical data about the customers that have churned. So we can put together a dashboard and we actually see the amount of money that we’ve lost due to customer return and things like that.
Now the data team has been approached by the business stakeholders and they’ve been asked to go further and try to predict which customers will churn so that some sort of action can be taken and maybe they can be retained and obviously that has a positive impact for the business. So that sounds simple enough, right? Let’s try to predict, who’s going to churn, what steps do we need to take to actually do that to make that a reality.
So if we look at the workflow at a high level, what are the steps that we need to take? Well, a lot of this just seem very familiar, we need to do data prep and build features, we’re going to train some kind of baseline model, and then we’re going to want to set up some sort of automated system that allows us to test that. Then we’re going to take the model object, put it into a centralized repository so that we can manage it better. Run our tests on that model that’s in the centralized registry and once we have… Assuming that the model passes tests, okay, great now we can actually make our predictions and start updating the dashboard and then the last thing we need to do of course, is to schedule this to update on a monthly basis, right? Some sort of schedule for retraining the model.
And these are the three personas that are on the data team, these are all the people who are going to be working on this. So what are the different tasks that they’re going to take on, right? Like who is going to do these different steps in the workflow? So let’s walk through that. So the first step is going to be taken over by the data engineer, right? Primarily making sure that you have pipelines that get data that is ready for training and exploration. That’s usually associated with the role of a data engineer. Similarly, at the other end of the workflow, when you actually have a pipeline that needs to be set up to actually score the model that is often performed by the data engineer as well. Sometimes that’s done by the ML engineer but in this example, we’re using the data engineer.
Okay, great. So we have some data flowing in. Now, what is the next step? The next step is going to be the workflow for the data scientists, what are their responsibilities? They are going to be in charge of taking the results of that ETL, the data that comes into the system, building features, doing exploratory data analysis and then starting to persist them into a feature store. The feature store is valuable because that allows other data scientists and ML engineers to basically discover what’s already been computed for machine learning. At that point, they’ll be moving on to actually train the baseline model and then there’ll be doing the work to pick their baseline model and promote it to the registry. At the very end of the workflow they’ll probably come back once the predictions have been made and update the dashboard at the end.
Now where the machine learning engineer comes in is sort of more on the operational side and making sure that this is going to be stable, making sure that the quality of the model is strong and really responsible for a lot of automation around those types of tasks. So the ML engineer is going to set up webhooks that will trigger slack notifications and run the testing jobs. They’ll also be responsible for actually writing the testing jobs and tagging the models as they come out of the testing job. And then at the very end, they’ll be responsible for setting up the monthly retraining job that incorporates the tests and things like that.
So if you put this all together, this is the various tasks and the various steps in the workflow that the different team members have to take on in order to accomplish the business goal that the business stakeholders have set out for them. So this is quite a lot, and this covers quite a lot of different things. How are we going to do this in a single place? How are we going to do this just on Databricks itself? So as I go through the demo, you’re going to see this same visualization or this same chart kind of show up in the notebooks and we’ll be going through and showing where we are in each step of the process. And I’ll walk you through the different tools that we use on a Databricks to enable this.
Okay, so let’s head over to the demo. All right, so starting from the beginning here, here’s our existing dashboard. This is the state of the world today. We have a historical analysis of all the money that we’ve lost, our turn rate and things like that. And this is where we want to go from and to be able to build a predictive model and try to understand which customers will churn. So in order to do that, right now I’m in SQL Analytics lens of Databricks and I’m going to flip over to the machine learning side of the product and I’ve already started working on some notebooks for this.
So this is what we’re going to be walking through here. Now in this case of course, we need to do ETL, that’s a prerequisite, right? We also need to make sure that we do exploratory data analysis. In this demo I’m not going to be going through that, if you want to get a deeper dive in that part of the product and the capabilities then I recommend you go check out Sean Owen and Austin Ford’s talk on exploratory data analysis and data science on Databricks, that’ll be a great way for you to learn more there. For these purposes, we’re going to assume that those things are already done. So we know what data set we’re going to use and we’re going to just move straight to some feature engineering and pick up the thread from there.
Okay, so here we are. I’m in my data scientist role, right? We’re not going to be doing any ETL or any EDA like I just said but we are going to start doing some feature engineering and then we’re going to wind up persisting our results to the feature store. So let me make this just a little bit bigger. Okay. So to get started, I’m going to read in a Delta table that is in the Lakehouse, that’s in our lake, okay? You can find this over here in the data tab. So I’m going to read this into Spark and let’s take a look at it. Let’s look at it and solve the [inaudible]. Then, while that’s running, I’m going to take a quick look over here and talk about what we’ve done to plan for scale and basically be able to have a robust data pipeline.
So I’m using the feature store capability that has been recently announced for Databricks. And I’m going to define a function that does my feature engineering for me. And in this function I’m using Koalas. The background here is actually that my teammate gave me some code that was written in pandas and Python, and it was very, very effective but it was limited to only running on a single note. So by using Koalas, I’m able to just take his exact same syntax and get back to running Spark because Koalas is a Panda’s API on Spark, so it’s very, very easy for me to do that.
Now that these have run before let’s take a look at what some of that data is. Okay, great, so we have a unique customer ID and we have a bunch of demographic data as well as at the end here, we have a churn column that tells us whether or not the customer actually want to churn. Okay, so this is a very nice data set, this is fairly clean, the one thing that we really need to do though is we need to switch these categorical columns to 100 and coded, right? We need to create dummy variables for all the different levels of these categorical columns. So, like I mentioned before, we’re going to use Koalas to do that and by using the syntax will allow this to be persistent to the feature store.
So let’s define our function here, okay. Now, using the feature store, we’ll compute those features on our Spark data frame that we read in from Delta Lakes. And then we’ll register this table in the feature store using the creative feature table function. Okay, great, we’ve done that. And one thing that’s worth noting is the description that I’m leaving here. This allows other people to basically understand what it is they’re looking at when they find this in the feature store. This demo is not going to be overly heavy on feature store, but I did want to show how we can start off by saving these features and being able to reference them later.
Now, another thing that we could do if you don’t have the feature store or you don’t want to necessarily use that, you can always save this data to Delta Lake, just like you’ve done before. And I have a little example of that here. Okay, great, so now we have features we’re ready to do some training. Now we could just open up another notebook and we could start running code in there and loading that data and building some models, but it is very, very efficient and I think a tremendous boost to our productivity to use the AutoML that is now built into Databricks. So let’s go over here to our AutoML. This is going to create a new experiment in MLflow and we’ll select the cluster that we want to run this on. We’re going to be doing a classification problem of whether or not someone’s going to churn.
We’ve just set up our training data, so let’s come over here and get the features, okay, this looks correct. The column that we’re going to be predicting on is churn and the experiment name is going to be one for data experiment, okay. Now just a quick note on the advanced configuration, we’re going to be evaluating the F1 score, the default timeout is 60 and the maximum of trial’s 200. We can go all the way down to five minutes or we can go all the way up to some minutes larger than that. With the click of a button, let’s get this going and let’s see what happens.
This should look familiar to those of you who have used the MLflow tracking server before. We’re sitting here inside the tracking server and we’re waiting for runs to come in. So what’s going on right now is with AutoML. AutoML is taking the data that we gave it and it is using Spark and HyperOpt to parallelize the search for the best model. Now HyperOpt, if you’re not familiar is an open source library that uses vision optimization to try and converge on what the optimal Hyperparameters are for your model. And what AutoML is going to do, is it’s going to parallelize that search using Spark and we’ll do so across a number of different models as you’ll see. So we’ll be exploring XGBoost, logistic regression and random forest that we’ve… Now as this is coming in, we’ll start to be able to see the results in the UI here.
All right. So the first thing that you’ll notice actually is that we have a data exploration notebook that’s available to us. So this isn’t to say that you shouldn’t do your due diligence and do exploratory data analysis but the way that Databricks has built AutoML is with a glass box approach. And the idea of a glass box approach is that you’ll be able to see exactly what’s going on. We generate all the code, we generate all the notebooks for you and then we give them to you so that you can open them up and edit them. This is in contrast to something like a black box approach, where you push a button and you get a model out of it. It could be a really good model, but you have no way to go in and actually inspect it and edit it and things like that.
So in the data exploration notebooks, we have basically some summaries of the data and visualizations and things like that. It’s really handy to look at the label distribution and see whether you have imbalanced classes or not, very, very cool stuff. But we’re not going to spend too much time on this, just wanted to highlight the productivity boost there. Now that we have some runs coming into our experiments, we can actually take a look at what we have so far. So like I mentioned, we have these different models that are being trained with different parameters, and we can see the metrics coming in in terms of how they were doing in terms of their performance. So this is going to keep going until we either converge upon the best notebook, the best trial, or we run out of time. So let’s take a look at the best notebook that we have right now.
Okay, let’s take a look at this one here. So if we drill into the particular run inside the MLflow tracking server, we can see all of the parameters that were long, this is using MLflow auto logging. So grabbing all of these parameters associated with this particular model training, all of the metrics in terms of its performance. And it’s been tagged with the estimator class, the estimator name, and very importantly, there’s some additional kind of documentation in things that are going on over here. AutoML on databases automatically logging the confusion matrix, the precision recall curve and the ROC curve. This is all being just tracked for you, you don’t have to do anything. And of course, the model artifact itself is saved here and there’s even some handy information to help you get started with putting this into the registry, which we’ll talk about more in a minute.
Another important thing to call out here is the concept of the model schema. So we want to be able… The model schema or the model signature is essentially the definition of what the model expects in terms of inputs and outputs. Now, if you have a model running in production let’s say, and you want to update it with a new version but your schemas don’t match then you’re going to break something. So having this captured at the time that the model is logged is a safeguard and it allows you to have confidence that your pipeline will be more stable.
The other important thing to note here is that the dependencies and the system environment for the particular model are also logged along with the model. These are things that have been in MLflow for a while but as long as we’re here it’s important to mention them. Okay, great, let’s go back to our experiments, our AutoML experiment and see where we’re at. Okay, we have 73 runs, this is still going. I’ve already gone ahead and taken the best run, so let’s go ahead and look at that one. Okay, this is a drilling into a particular notebook that was generated from a run, one of those runs in the track and server.
This auto-generated notebook is exactly what we mean by a glass box AutoML, right? Like we can come in here and we can edit this and we can decide what we want to do with it. Now, the one thing that I want to highlight here is actually let me just show this just quickly. So with auto logging, right, we set up auto logging here, and then you can see the same familiar MLflows and texts that you would normally use if you’ve been using it MLflow. So there’s nothing… This is totally transparent. We’re telling you exactly what’s going on, we’re just offering it to you and making it easier.
Now, one of the things that’s pretty cool is that we actually use SHAP as part of this auto-generated notebooks, to help you understand what the feature importance is of the variables that are in your training data. Now, just to note here, by default, we’ll only use one sample. I’ve actually already come in here and edited this. So just for the sake of expediency, the setting here is to just use one sample to generate the SHAP values, but you can come in here and increase it. That’ll just increase the runtime. But yeah, you’ll be able to see that obviously here tenure is the most impactful variable in our dataset in terms of predicting, who’s going to churn and who’s not. Okay, great, so now we have our best model, right? They have our feature engineering pipeline done, we have our best models and AutoML. Now we’re going to go and start to deploy this.
So before we can push this model to the registry, the ML engineer is going to come in and they’re going to set up some webhooks. Now, if you’re not familiar with webhooks, they’re essentially HTTP requests that are triggered when an event happens and they’re super flexible, they’re super useful. So I have a diagram that I’m going to show you in terms of what goes on when these webhooks fire but it’s important to understand that with MLflow and in the model registry, you can have a webhook fire when a model is put into the model registry, if you create a model version, if you request that a model is transitioned to a different stage in its life cycle and so on.
Now there’s two primary types of webhooks that we’re going to be using. We’re going to be using a webhook to send notifications to slack and then we’re also going to be using one to trigger a Databricks job. So how is that going to work? So let’s take a look at this diagram here. So there’s two notebooks here. We have a promote best run to registry notebook and the task in this notebook is to just annotate the model and then request transitions to staging, okay? This will also… Even if it does not successfully push them out to staging it we’ll start by putting the model in non-stage in the model registered. Then we have a testing notebook, that we’re going to go through it a little more detail. And what this does is it checks the presence of the model schema or a signature that’s being put into staging, checks to make sure that the accuracy of the model is valid across different, different demographics.
And then it also checks to see if there’s any documentation or artifacts presence with the model. Depending upon the results of those tests, we’ll set different tags that will live with the model and the registry throughout its life cycle, and then we’ll approve or reject the transition. Now where webhooks come in is as soon as soon as the data scientists submits the request to transition the model to staging, then that webhook is going to fire and it’s going to send a notification to slack to let the team know that, “Hey, somebody is requesting this transition.’ And then it’s also going to submit a request to the Databricks jobs API, which will in turn, kick off the testing job against that model.
The results of the testing job will trigger another webhook that will basically let the team know in slack that, “Hey, these are the results of the test.” Whether they were successful or not. And if the test fails, this testing notebook will move the model into archive, it’ll reject the request to transition and put it into archives. And if it’s successful, then it will accept the request and move it into station.
Okay. So let’s take a look at that. This is how you create a webhook. I’m not going to go into too, too much detail, but we’re going to get the model from the widget up here and then we’re going to create some Jason that we will post to the Databricks REST API for MLflow, the MLflow REST API. So we’re going to give… In our request, we’re going to specify the model name, the type of event that we want the web to fire on. We’ll give it a description, but in this case we’re going to set this to active because we actually want this to be used. There is a testing mode that we could use. And then of course, we have to say what is the job ID that we’re going to run? And then where is the job going to run? And then that’s it. We just make that post request and pass that body, that jets on to the end point and then the webhook is created.
Similarly, when we set up the slack notification, we do it the same way. The only difference here is that we have a web URL that we get from slack and then we include that in our webhook requests to create the webhook. Okay, wonderful. Now, the other thing I wanted to call out here is there are end points that essentially list the webhooks that you have, so you can see all the ones that you’re working with. I have three of them right now, and then we can also delete particular ones if we want to.
All right, ML engineer’s job is done, setting up the webhooks, let’s come over and start actually moving this to the registry. So now we’re moving along, we’re getting closer to the end here. We can promote our best amount to the registry where we’re going to do in this notebook is we’re going to annotate the model that we want to submit to the registry. And then we’re going to make that request to transition it.
So the value of the registry is really around being able to discover and see all the models that are running in production and staging in the various stages of their life cycle. You want to be able to lift things out of the tracking server with all of those different trials and things like that, and put it into a central place where, “Hey, this is the stuff that we’re actually using right now. And these are the models that we’re going to be updating going forward.”
Okay, so the first step here is to actually take the model from the run ID and promote that to the registry. And in this case, what I’m going to do is I’m actually going to get the run ID, and then I’m going to set some tags. I want to tag what the table was that this model was trained on and I want to know what the demographic variables are. This is something that as a team we decided we want to do, because our tests are going to get check to see whether or not there’s skew in the demographic variables in terms of performance.
Okay, so we can run this and this will add a new version to the model registry for this particular model. Okay, now we’re going to update the description, all right? In terms of documentation and best practices and making sure that people know what is going on. We need to do this, you can’t push models to the registry without adding these descriptions, but best practices I think are to do this. And in our tests, we’re going to reject any model that is submitted to the registry that doesn’t have any description. Okay, so this is what we’re doing right now. We’re going to submit this request and then it’s going to fire off these two different events.
All right, so let’s make that request and let’s leave a comment there too. Now we should be able to see if I bring up slack here, we should be able to see that at this moment, we had a request to go transition to staging and we should also be able to see a job running and date of births. So let’s come over to the jobs and there we go, we have our validation job running. So, this has been running few other times today, this has just kicked off. Let’s see if the notebook is already available, yes. Okay, great, so here’s our testing, but can we can actually just look at this here in the jobs UI, we don’t even need to go open up the other notebook.
So this is the ML engineers job, right? They wrote this notebook, they’re working on this kind of a thing. The first thing is to get the feature store and we’re going to receive the payload from the webhooks. This a little bit of Python over here is just getting the payload and making sure that we know what model name it is and what the version is, okay, we’re working with version 15 of this model name, great. Now let’s make sure that this model can actually make predictions. So we’re going to do that here, if the model can predict, then we will tag MLflow. We will tag this model in the registry as it can make predictions and vice versa if it cannot. We will also do a signature check to make sure that the schema has been included with the model.
We’ll also take a look at the accuracy, like I mentioned, across different demographics. So if the accuracy for any particular demographic that we submitted is less than 55%, then we’re going to fail the test. What’s interesting in this data set is actually that senior citizens, they have poor performance in terms of churn. So the model predicts fairly well, whether you’re male or female, but with, if you’re a senior or not, then that has a pretty big impact. So that’s something to note. Very, very close to being rejected there.
Here’s our check to make sure that a description has been included, here’s our check to make sure that the artifacts were stored alongside the original model run in the tracking server. So we can always go back and look it back, very important for reproducibility and for, and for general documentation purposes and auto audit ability. And then here’s the results of our test so, passed with flying colors. And then the last thing we’re going to do is we’re going to send the slack message that you saw pop up over here to give us the results of the test so that’s wonderful. And then, because this model passed the test, we will approve the transition to staging, and we’re good to go. Now that was a whole bunch of stuff that we did in the model registry, so let’s go take a look at all the things that happened there before we move on.
So I’m going to come over here to the model registry, I’m going to search for my model. There’s our version 15, there’s our description of the model. Let’s see, what’s going on with version 15. It’s a little bigger now, so here’s all the tags from our tests. There’s the input schema and then here’s the results of our webhooks, right? So we made the request, we included a comment at that time and then we approved it after the tests ran. So brilliant, we’re automating things, this is great. So we have a model and staging, it’s ready to do inference.
Let’s go over to our inference notebook. And this is very, very straightforward. What we’re going to do is we just load in the model using a Spark User Defined Function, so we know that this is important here, where we’re going to load in the model as a Spark PDFs, okay? And the model URI, we no longer have to deal with random strings and things like that, the path to the staging… So the model that’s in staging is stable, so if we update the version of the model and the registry that is marked as staging, then this is will always refer to the latest version of that.
So here we go, we have our model loaded into our [inaudible]. We’re going to load in the data we have for customer return and then we will predict this. So we’re taking the model that was trained in Python, not using Spark and with just a few lines of code, we’re able to predict on a much, much larger data set. So I think that this speaks really, really well to the ability to have a robust data pipeline and to be able to have strong scalability going forward, right? Whether you’re working with Spark machine learning models that are in the Spark ML API, or you’re working with Python itself.
Okay. And then just so that we can also make this available for the dashboard. We’ll write these predictions to Delta too. Okay, now we are at the stage where data scientists can come back in and update the dashboard. So this is what that looks like. Come back in here and we’ve updated the periods now that we have these predictions, we can actually go in and say, “Hey, these are the customers that we think are going to churn. This is the amount of money that’s at risk of churn.”and things like that, this is a much more insightful dashboard, this allows business users to actually take action and go say, maybe we should give a discount to this particular customer or maybe we need to go talk to them and see why it is that they’re upset or whatever it is.
So this is the outcome of that workflow, right? That long multi-step workflow that we went through. So there’s still a one more thing to do. The models are not going to update themselves, right? They’re trained on a snapshot of historical facts and the world is constantly changing the customers that turn today are not going to be the customers that turn tomorrow. So we need to be able to go and update this on a regular basis.
So we have one more notebook that we’re going to look at, okay, right over here. This is our monthly retrain job, final step in the process here. And what this does it’s very, very straightforward, actually. We just load it in the features that we have that are up to date. And then instead of using the AutoML UI, we just use the client library. So we import Databricks AutoML, which have our classification, same as before you give it a time out of five minutes in this and then we get a lot of that same information, right? We get the data exploration notebook, we get the best trial notebook and so on and so forth.
In order to grab the… I don’t want to have to go hand in hand code which run was the best run or anything like that. So the output of the AutoML, I can actually access the run ID and I can access all a lot of different information from that. So that’s what we’re doing here, we are getting the run ID from the model. We look at the best trial, get the run ID and then we’re able to tag that run like we did before in our first baseline model, we’re able to tag that and then push it to the registry. And then this is all the same workload that we did before the first time around, right? We make sure we add our comments, make sure that we move the request to staging with a comment associated with it.
Okay, now the last thing we need to do now that we have this retrain job is we need to schedule it to run on a monthly basis, okay? So this is scheduled to run the first of the month at midnight and this job is a little different than the previous job, this is actually going to do three things. So it will run our feature engineering notebook to make sure that the features are up-to-date and I’ve set this to retry max three times, and then it will run the AutoML retrain notebook that we just looked at, the output of that notebook is going to be the validation job. So it will run the validation job after that. And that’s pretty much it, we can take a look at a particular run here. So yeah, so we can see that this pipeline ran successfully and updated the model that was in staging, archived the one that was there before and we’re good to go, we have our whole pipeline up and running.
So, that was quite an exhaustive and end-to-end walkthrough of how you would go from feature engineering all the way through to model deployments, including building a baseline models, setting up automated tests and validating that the model is indeed effective. And updating our dashboard, satisfying our business stakeholders and perhaps being heroes for a day.
So another way of looking at this if we lose the pure abstraction and we look at it more from Databricks features and product focus is, what we really did was to walk through the whole full ML Lifecycle was, we have these different personas, all working on the same platform, going from data prep and featurization through model developments in this case we use the AutoML because it’s awesome and it’s easy to use, gave us a great baseline model that really leveraged tracking heavily all of the full richness of flow tracking capabilities. And then we took the best model and we put that into the model of registry. We had some tests that ran over here just to make sure that it met the standards of the team and of the company and of the regulatory environment that we’re in. And then once we had that, we were able to move to actually deploying the model and running it.
Now, in this case, we didn’t get to production but the reason for that is that it was the first time that we were updating a dashboard so, we were still working with a staging kind of environment, that was the first time that we put that dashboard together and we would want to go actually talk to the business stakeholders and validate that this is what they want and iterate on that until we get to something that we can feel good about calling production.
So if you want to learn more about this kind of a thing, and you want to hear more about this kind of approach, just general approach to MLOps and ML Engineering, the field team here at Databricks is going to be releasing a bunch of blogs and a bunch of content on this this year. So you can see coming out on the Databricks box soon, the need for Data-centric ML platforms and then you’ll have selecting technologies and platforms for data science and machine learning. Sometime after that, there’ll be more stuff on model and data monitoring on Databricks. We didn’t talk too much about monitoring in this talk but that is coming and there’ll be more. And even just a cursory look at all the talks that are available inside and outside of Databricks. I did an AI summit, there was just a ton on MLR, so there’s plenty for you to learn more there too. And thank you very much for sticking around.

Rafi Kurlansik

Rafi is a Sr. Solutions Architect at Databricks where he specializes in enabling customers to scale their R workloads with Spark. He is also the primary author of the R User Guide to Databricks and th...
Read more