Software engineering evolved around certain best practices such as versioning code, dependency management, feature branches, etc. However, the same best practices have not translated to data science. Data scientists who update a stage of their ML pipeline need to understand the cascading effects of their change so that their downstream dependencies do not end up with stale data, or unnecessarily rerunning the entire pipeline end-to-end. When data scientists collaborate, they should be able to use the intermediate results from their colleagues instead of computing everything from scratch.
This presentation shows how to treat data like code through the concept of Data-Driven Software (DDS). This concept, implemented as a lightweight and easy-to-use python package, solves all the issues mentioned above for single user and collaborative data pipelines, and it fully integrates with a lakehouse architecture such as Databricks. In effect, it allows data engineers and data scientists to go YOLO: you only load your data once, and you never recalculate existing pieces.
Through live demonstrations leveraging DDS, you will see how data science teams can:
Tim Hunter: Good morning, good afternoon and Good evening wherever you are. It is our pleasure to welcome you to this session about the future of data science and the integration with computer science and computer software. So Brooke and I in this talk, we’ll discuss about how you can integrate data science and all the code and all the software that you wrote for data science, with the traditional code that you deploy in the rest of your applications. And this approach, which we call data-driven software, is essentially at the core of what people talk about when they talk about integrating AI with business applications.
Because typically, people when they talk about AI, what they really mean is integrating data with their existing logic. As we’ll show you here, this comes with some new challenges. But also, this is now time to come up with new solutions. And we hope to convince you here in this presentation, that data-driven software could be a way for you to address these problems. So, let me quickly introduce here my co host and partner, Brooke Wenig. Brooke has been working at Databricks for a few years and she has taken a class in machine learning from Databricks.
Her name is probably familiar to you. When she graduated from UCLA, in computer science with a master’s degree, and you have probably also seen her as being the co host of Data Brew, a series of talks hosted by Databricks. So when she’s not caring for her lovely new Barbie, you’ll probably see her in California on her bicycle exploring some new routes.
Brooke Wenig: Thanks, Tim. And now I have the pleasure to introduce Tim Hunter, who is a senior AI specialists at ABN AMRO. Prior to joining ABN AMRO, Tim and I were co-workers for many years at Databricks. He’s also the co-creator of many open source packages, including GraphFrames, Koalas, Deep Learning Pipelines and now data-driven software. He holds a PhD in computer science from UC Berkeley. And when he’s not creating all these open source packages, you can see him sailing around the world and residing in Amsterdam these days.
So now I’d like to discuss the outline for the talk today. We’re going to talk about the challenges of integrating machine learning and data, and how DDS helps, and in particular how you can go YOLO, or only load your data once, with data-driven software. We’re then going to give you two demos, discuss the roadmap and answer any questions you have at the end. And as this is a recorded presentation, please ask questions in chat throughout, we can still respond even while we appear to be actively speaking.
So now I’d like to discuss the cycle of machine learning. Typically it’s going to start with some business problem that you’re trying to solve. So you’re going to go through this process of data understanding, preparing your data, data cleansing, feature engineering, building models, evaluating them, if they perform well, deploy them, and then continue to repeat this iterative process. However, you don’t just have one model, you typically have many models, and there are some downstream dependencies.
So the output of this model is now the input to this model, et cetera. And every step along the way, there’s going to be associated code and data artifacts. And we need some way to be able to version these and track these so we can understand if we change this step here, what impact does it have on any of the downstream models or downstream data processing steps? However, there are some issues with traditional data science as how we approach it these days. So you might have faced the issue of teams operate in silos, it could be you have decentralized data science organizations that don’t speak with each other.
Or you have one team that’s responsible for generating the feature engineering, another team that’s doing the data science, and the cascading changes don’t always propagate very easily. Furthermore, there’s often large overhead in coordination. So for example, if one team wants to touch some portion of a pipeline that impacts all of these other downstream teams, it will typically lead to a drop in productivity, lower confidence in the result, I can’t tell you the number of times where I’ve been in a meeting and somebody says, “Are you sure you used the right table/ Did you grab it as of April 1st?” Whatever the date may be, et cetera.
And in additionally, there’s often increased costs. Data scientists like to operate under trust but verify. Let’s just rerun the entire pipeline, just to be sure that these results are correct and reproducible. And so you can see that this takes a lot of time, and it can actually be quite costly from the computation side as well. There are some solutions to help you version your data and version your code, but they’re often specialized in either some fragment of the pipeline. So for example, just on the model monitoring side, and then you have a different tool for data quality.
Or they’re specialized in some specific technology, for example, they just work with the SK learn stack, or just work with SQL, et cetera. And so now I’d like to hand it over to Tim to talk about how data-driven software can address many of the challenges that I just laid out on the previous slides.
Tim Hunter: Thank you, Brooke. So data-driven software helps you combine code with data, as Brooke mentioned, and in particular, I’m going to describe here a couple of the consequences that it has about being able to seamlessly combine these two pieces together. The most I would say immediate feature that you will see is that it will allow you to treat data as if it was code. And in particular, it will allow you, when you check out some code to be able to branch it to version eight, to be able to merge it and essentially use all the features and all the methodologies that you have already experienced when working with computer code.
Another feature that comes from such a methodology is that all the dependencies that you have between your data sets, because they’re expressed in the code, and because they’re being understood by data-driven software, you will be able to explore them and process them, visualize them even before you run any sort of code. And in particular, when you have to compare for or when you have to analyze the impact of the changes that you are doing in your code, you will be able to see how your data pipeline is being changed, and which pieces of this data pipeline be it table, features, are being updated.
As Brooke mentioned, one of the challenges of working with data, that it is usually … they usually ensure if we’re working with the freshest with the latest version or if we’re dealing with stale data. Because data-driven software analyzes your codes and integrates with the rest of your data pipeline seamlessly, it is able to cache previous results, and it is able to serve to you back some results that it knows have not changed. And then very quickly allow you to iterate on the results of these outputs. One of the aspects which I believe is also what makes quite a difference in the way that we deal with code with data, is that when you think about modern data, when you think about what it means to work with data, usually we do not imply just data sets.
We do not just imply tables being stored inside the database. But instead, data now has a much more general view being it data sets, but also models, statistics that we derive from the models, graphs, plots, anything that we generate, any sort of artifact that we generate from this raw data. And data-driven software allows you to combine all these pieces into one namespace and treat them as if they were part of the same package. So this is why when as Brooke said to me once, when you use something like data-driven software, you can go YOLO, you can simply load your data once and never have to think again, where it is coming from, how it was generated.
Because behind the scenes, it will all be treated like good. So in the rest of this presentation, we’re going to present to you one version and one implementation of this principles. And here we’re going to show you an implementation in Python, which is the most popular and the most prevalent language for writing data science these days. But all the principles that we discussed about here, are definitely applicable to any other programming language that works and combines data with code. So, if we move simply beyond this small example, what can it give us into larger pipelines?
So, let’s look at the real example here, about what it means in practice to be able to use the principles of data-driven software. You can have large pipelines, which combine together graphs, models, features and all these pieces are all interlinked with each other. And this graph here, which is generated directly by data-driven software, this graph shows you all the complex interlink dependencies between multiple pieces of data that are being produced. So all of them here happen to be pieces of graphs or tables or features. In this case, one of the collaborators who was working on this pipeline, made a change in the way a feature is being calculated.
And before he even runs or merges his code, or before he shares his results with the rest of his teammates, he is able to understand what is going to be the impact of his change. So in this case, his feature feature4 will re-trigger the creation of two other features, and also the assembling of all these features that are fed inside a machine learning model. But furthermore, because their driven software has an understanding of how each of the pieces of data is being generated, being it small or large, it is able to just run and just recreate the pieces that are strictly necessary for handling this code change.
In particular here, it will take just minutes to be able to recreate what was needed. So how much does it help in performance? Well, Brooke mentioned that usually we like to rerun everything. And sometimes before DDS, I was also re-triggering large pipelines just to make sure, quote unquote, as we like to say. Well actually as we’re forced, as we’re forced to say. In the case of the pipeline that I just mentioned, it takes about 10 hours and multiple terabytes of input to process to be able to generate the output. But in the case, why would we need to do that, if nothing has changed?
If the raw data has not changed, if the code to generate all the steps on all these intermediate features have not changed, why would we need to do that? DDS is able to on the fly, look at the code, look at the data sets that it depends on. And it is able to see that nothing has to be rerun, and doing all this check for this particular pipeline, it takes a matter of seconds. So you can imagine all the gains that you can have, when you can simply check out some code, run it and it takes just a few seconds, because all it needs to do is checking that the table that you have generated are the same as the ones that you had before.
We’ve discussed about why data-driven software, what kind of performance you can expect from it, and how we can help you. Now I’d like to make it more concrete. And I’d like to show you in practice, how it feels like and how much changes you can expect from the way you write software for data science. And really actually, I hope to convince you here that there are not so many changes that you need to do, it is usually a matter of annotating a few key places inside some existing code, and following some best practices when you write it. So, we’re going to … Brooke and I are going here to discuss the standard pipeline, something that you can see in pretty much any experiments when it takes book about how to use machine learning.
We’re going to load the data sets. Here it will be a data set about wine, we’re going to build some features about this data set [inaudible] the model, evaluate the model and get some scores out of that. And when we want to do that, then we can usually have two pieces, we have the code, which expresses or the transform we need to do, the model that we create, the scores we generate. And next to that, we would like usually to store some of these outcomes. For example, we would like to store the model somewhere, we’d like to save it. So that’s another system such as MLflow can log it and then deploy it.
We would like to store the evaluation scores that we’re going to generate so that we can visualize that or we can share them with other teams so that they can see what is coming down their way. So, data-driven software allows you to reconcile these two pieces together. So if we start from the code that you would write, where you would put some sprint statement so that you can track what is going out where you would ask Python to write the models somewhere onto your computer, DDS will automate all these pieces. Here are the changes that you need to do to your code to combine together the where, the result should be put, and also the usual control flow of your program.
So, instead of directly calling a function to train a model, you need to also add an extra annotation, to say that the outputs of training a model with these features and with this learning rate, will be stored under a path called model. And once you have this model, same thing for the evaluation of the scores, we would like to store the outcome of evaluating the model and creating the scores, we’d like to keep them in a path called scores naturally. We removed all the pieces that corresponded to storing or printing the results, and instead we ask this DTS package to do it for us.
So it sounds like a small change. So now you see what the code looks like in practice. So, how does it help us? So far, it’s in that it adds a few annotation. But what does it really give us in practice? Well, here in this code, we have learning rates of 0.01. And let’s say that in our code, we decide to change it to 0.05, what will happen then? In this case, DDS will understand and analyze your code, and it will see that because you changed how you are training your model, the outcome is going to change, the outputs of this function, the variable model is going to change.
But also that means that the model that you want to save and store and present to other people inside your storage system, will be different. So, when you change this learning rates, DDS will retrain the model and only in this case, it will retrain the model, store it and override or update the version of the model that exists in the path and return this new model. But because it is able to understand the flow of your code, it will see that this model is being used for computing the scores. And as such, it will reevaluate the scores for this model, update them and write them again inside the … and write them back into the storage system, and return to you this course.
To finish this quick tour about all the operations that are necessary for the DDS, it is actually very fairly restricted, and it’s a very simple model in that sense. So, data-driven software revolves essentially around a three main function and also one extra in the code of Python. So, the main function is keep, which says two things, it says that it is being given a function, and what it does is that it takes a path and a function, and when it does it reevaluate this function in its arguments, if necessary.
And it is asked to keep the results inside this path that is being given. And the path here can be arbitrary path. It can be path on to your local machine, which is the default. But also, as Brooke will show you in a demonstration, it could be a path on a more complex file system like the Databricks Distributed File System, which could be S3, it could be the Azure Data Lake Storage System, any path that can implement the value store essentially. So keep is essentially a smart caching system. It only reruns this function and store the outcome when needed to.
In order to make it simpler within Python, because of the way Python works, it is also available as a decorator. And Brooke will show you some examples about how to use that. And with this decorator, you will see that you can write a function that generates any sort of data, being it models, being it data frames, being it delta tables, anything that you can think about, you just add one quick annotation to the top of it, and it will be able to act as a smart caching system and also track the changes that are happening when you want to regenerate this data.
Finally, the last big piece is load, which directly accesses the path to an artifact that you have already created and directly load it. It doesn’t know how to create it, but it simply knows how to access it. But it is more subtle than that and smarter than simply loading it, in the sense that it will also track itself the dependencies and the changes that will happen. So that if you use this function inside the complex code it will be able to see if something needs to be updated, it will propagate these changes. And finally, there is also an evaluation function, which is a bit like keep, but we do not keep the result.
So, how do these functions work in practice, and I would like to give you a quick introduction about how it works and how data-driven software can be so good at figuring out what do we run, and what to update and only when it needs to? So let me show you how DDS works before Brooke does a demonstration of it. On one side you have your code, which as we said, create a model and then take this model and evaluate it and create some scores. And for every of this scores, for every time you create an object [inaudible] it with a path by their driven software, DDS associate to it a unique signature.
And for this unique signature, it stores the results inside a data cache. So this way, whenever it is some code, it can check, it can calculate a signature very quickly for this piece of code, check it this piece of code has run before by inspecting the contents of it cache. And if necessary, we’ve run it in a data cache, or simply returned to you the existing results. So this is how it is able to just run when it needs to, and not further than that. So now I’m going to give the screen to Brooke, who will show you in practice what it looks like and give you a taste of what data-driven software is how it is working in practice.
Brooke Wenig: Thank you Tim. All right, so now I’m going to go ahead and share my screen and walk you all through a demo of how to use DDS on Databricks. So to start off, you just simply need to install the DDS package from PyPi on your cluster to be able to run this notebook. I’m now going to go ahead and clear out any directories in case there is anything already written there. And to set it up to use DDS on Databricks, I’m going to set the store to be DBFS rather than some local path. And then I’m going to set the data directory to be data manage DDS and the internal directory for any of the cache results to be data cache DDS.
In the demo, Tim is going to show you … he’ll also show you how you can link these results as well, rather than copying over the data. So this is the simple, hello, world example of using DDS. Here I have a function called data, it’s going to print executing data, return hello, world. The only thing I need to do to use DDS is to use this annotator which says, data_function and store the results into hello_data. And so you’ll see the first time that I run this function, it’s going to say executing data, and then it’s going to write out some bytes to our data cache. If I go ahead and rerun it, you’ll notice that it didn’t actually need to reevaluate the function, it’s able just to load it in from DBFS.
So now let’s take a look at where it’s stored here. So I can see that the data is located here under data_managed_dds, I have this hello_data function. And I can see the return output that this function returns. And the data is also cached in data_cache_dds. And so let me show you the real difference between these two if I go back and make a modification. So if I go up here, and I say hello, world and I’m very excited about it, I add some exclamation points. The function has now changed because the underlying code has been updated, so now it has to re execute this function, and it’s going to go ahead and write out these bytes.
And so if I go ahead and take a look at hello_data, it’s been updated here, and the data cache will now have two copies, the original version and the updated version. And this is super helpful, because if I go back and I say … Actually, I’m not that excited, let me get rid of these, I go back to the original function, it knows that I didn’t need to re execute it. Instead, it wrote out these bytes, just to copy the output from the cache data over into that managed_data directory that we specified. And so this way, if you’re a data scientist, you’re experimenting with something, that approach didn’t work, you want to go back to your original approach, you don’t have to rerun the entire pipeline.
We’re able just to copy over the data that’s in the DDS cache over into the managed_data. And we can see that it’s updated here as well. We’re now just back to hello_world. And anytime we call this function, you’ll notice it doesn’t have to trigger any calculations, the data stored on DBFS, it simply just loads it in from there. And so, that’s a super simple hello, world example. Now let’s look at a slightly more complicated example, where we have some dependencies across our functions. So here, we’re going to have function f1, that simply prints out evaluating f1 returns one. f2 on the other hand, depends on some outside variable, plus the output of f1. And f3 simply adds these two together, and you’ll notice that we’re adding this decorator to specify, please store the results of all of these computations for f1, 2 and 3.
So the first thing that we’re going to do, is we’re just going to look at the display graph of all the dependencies, f1 doesn’t depend on anything. But f2 depends on f1. And f3 depends on both f2 and on f1. So we can understand our dependencies. Let’s go ahead and execute f3 three now. The first time that it runs this function, it needs to go through and evaluate everything and write it out to the file system. And so now I want to understand, if I update this outside variable, we go back up, you can see f2, depends on some outside variable. If I update some outside variable, what will that impact look like on the graph?
What has to be recomputed? And so I love this visualization, Tim showed a much more complex example of a real world pipeline. But you can see here, if we update some outside variable, we don’t need to rerun f1, we will need to rerun f2, which in turn, will force us to re-trigger a computation on f3. And so now if we go ahead and rerun that, you’ll notice it has to evaluate f3 and f2, but it doesn’t have to reevaluate f1. And so now anytime that we run this, we now just get the direct result, and you can see how wicked fast that computation is. So by using DDS, you can better understand your dependencies and what you’re trying to build, so you don’t need to recompute the entire pipeline just to make sure that everything was run end-to-end.
And with that data cache, we’re able to copy over the data, and we’re able to go back to a prior state of an experiment without needing to rerun everything. So that’s all I have for this very simple demo here. Now I want to turn it back over to Tim to talk about some of the implementation details of DDS.
Tim Hunter: Thank you Brooke. So as you could see From Brooke’s demonstration, it all seems very smooth. And it may sound a bit magical, sometimes it runs and only when you do changes, it knows exactly that you made some changes with everything that it depends on. So how does it work? How does it know how to do that? The basis behind it is a technique called semantic hashing, in which essentially, data-driven software has just enough understanding about the flow of your computer code, in order to be able to calculate unique signatures for what you are doing. And if you have used a software system for versioning, like Git, this is effectively what you are using.
This is a bit of the same ideas that you’re using in the background, in the sense that everything, every change that you do, will be given a unique signature. So just to take a small example here, let’s say that we’re computing that piece of data called my_data, which depends on calling a function with [inaudible] and so on. So for this piece here, for my_data, we can calculate a unique signature that corresponds to all the functions and all the elements it depends on in order to be calculated, in a unique fashion. So in this case, the signature will be based on this function, training_animal, and its arguments, dogs.
But also all the dependencies that this function itself will have. So, this function depends on another function that would for example, check the contents, and the file name that would be needed for loading it, and so on and so on. So you’ll notice it does not need to run anything, it simply inspects your computer code, see what it contains, see when it calls, and based on that, it can calculate a unique signature. Another piece that you may also notice is that, even though here we are going to use pandas to read a CSV file, it is not going to depend on how pandas is loading a CSV file.
It is only going to see that you’re calling some extra libraries and assume that they are part of the environments and simply just focus on the code that you have been writing. So this is how it is able to make a difference between system libraries and code that you depend on. But for which the details are not very relevant, and the pieces of the code that you wrote that contains the business logic that you really want to check. So this is how data-driven software allows you to have more complex programming here because, for each of the function that you call, for each of the datasets that you manipulate, it is able to calculate a unique signature.
And this signature is what ends up determining if you have already run something. This is why with data-driven software, if your code has not changed, if the input data has not changed, then the output will have the same signature. And hence the output is going to be assumed not to have changed. So let’s retake our initial example, where we have a very simple machine learning pipeline. And let’s say that when we run it, we have this set of signatures here. So, if we happen to make a change in the build feature function, for example we decide that we’re going to change the way if we build our features.
Then it is going to re-trigger, it is going to recalculate a new signature for the training data. And then by dependencies, by cascading, the models and the scores will have different signatures. And the crucial part here is that calculating the signatures, because it only depends on the computer code, it does not depend on calculating anything on the data itself, is extremely fast. So for a small code like this, it is a matter of milliseconds to compute, which is much faster than going inside some data systems inspecting every row or every statistics that you need to do to understand what it contains, and also it is called totally oblivious to the size of the data that you’re using.
You can make terabyte size tables, it does not matter, because all it is doing is inspecting the computer code that you’re using. So, because of this way that signatures are being built for every piece of data that you create, and because the signatures depends on the code that you’re writing, this is how you can very seamlessly collaborate and integrate changes or [inaudible] changes and have different views, just like you would with normal computer code. So let’s say for example that we have a main branch that is going to contain all our code and and it’s going to be the one that is being deployed, the one that people refer to.
Then how does it work to create a new feature with that? Well it’s quite simple, we create a new branch, we create a new copy of the code. And then we can start to edit it, then let’s say we’re going to update our features. So because the code is going to be different, we’re going to recreate some new pieces of data with the different signatures, just like we explained before, just like we said before. And then when we run this code in the dev branch, it is going to create and put inside the cache, some new data, so our new training data, or new model or new scores. And this cache is in a collaborative environment, this cache is shared.
And this is one of the crucial parts of using data-driven software for collaboration. Everybody who runs something, will contribute to the common cache. And so when Brooke run a notebook before, and when she runs a function that created some data, I will also be able to see that she ran that and access all the results that she did create already. So this is how, when the development branch has run the code they changed, they created some new data set and they put it inside the common cache. So now, let’s say this code is good, what happens when we merge it inside the main branch?
What happens when we rerun the codes inside the main branch? So we merge a code, and then once we reevaluate it, then it will see that it will recalculate the signatures. But the signatures, we have already seen them, they were already evaluated inside the dev branch. So, we have already calculated everything we needed to calculate in the dev branch. Because the code that we have now in main, in the main branch is the same after the merge that what we had in the dev branch. So, all we need to do to update all the data set that we’re creating in the main branch, is a very simple operation.
We do not need to rerun the code, it has already been run in the dev branch, all we need to do is update the outputs that we’re putting in our storage system. So all we need to do is update where we put the models like the model, the scores and the training data, without having to rerun dev. So even if it took hours to retrain our model, it was already done once in the dev branch, there’s no need to redo it after that inside the main branch, all we need to do is simply swap what correspond to an output and say this is going to be the new output. So to make it more and more concrete, Brooke and I are going to show you how we can collaborate together on a very simple data pipeline.
And I’m going to switch to my Databricks notebook for doing that. I am now again in the Databricks environments. And this time, we are going to use this repository feature, to show you how you can quickly switch between branches. In particular how data-driven software really allows you to do this switch very quickly. So, I’m working here for my main branch. And in this branch, I have an existing data pipeline, which I am going to run. So this pipeline is just as before, very simple. We’re going to load some data from the internet, and this is just a simple CSV piece of data that we’re going to simply [inaudible] and access to the panda’s data frame.
So this is a standard data set in machine learning which is about ensuring the quality of of y. And I’m going to trigger it because it takes a few seconds. So in this case, our pipeline will load the data, do some little bit of feature engineering, then splitting the model and splitting the data set into a train and tested, and we’re going to take our training data set and use it to build a model. Also we’d like to store our model to make it available to other people outside of our team. Also, we’d like to expose the statistics that we’re going to generate for the model, to see how good it is.
So again, we have here, a model step function that returns a little piece of JSON containing the R square score and the main square. Just using standard scikit-learn code and panda’s code. So what does it look like if we want to plot the graph? Well, here, I am going to run it once. I already run in the sale before, I already run here the load data function. So this is why it shows up in grayed out. It says already, DDS already cached it. But, DDS has not seen yet how to calculate the JSON statistics for my model, and the model itself. So, I’m going to evaluate the model. Now it is going to take a few seconds to run.
Just to make it a little bit more interesting, I’m going to do a grid search with a cross validation, just to follow standard practices. Okay, it took 10 seconds. It is quick enough for our in purpose here. But you can see how it could be longer. And as Brooke showcased before, if I rerun it, the code is the same, the outputs will be the same, it will directly give me the outcome that I’m looking for. And if I want to, I could directly load my model without access to the code. And in another notebook, I could access the statistics for doing it. Okay, so this is my pipeline here. Brooke, what do you think about this pipeline?
Brooke Wenig: I think it’s really good Tim. But I would love to see an RMSE metric there. Could you actually go into my branch, it’s called model scores and pull the updated one which has RMSE computed.
Tim Hunter: Sure, I’m going to do that. Close. And because this is DDS, I trust Brooke has already checked her code, and I trust that Brooke has already generated all the data sets that are necessary. So if I click run all, you can see how quickly it is going to run, it immediately has all the results without on my side to have to be running anything. Even if I rerun all the notebook, everything is already pre calculated, nothing else to do. Okay, Brooke, I love it. Let’s merge it. And to follow standard good practices, we have all this code backed here by repo. I’m going to take your branch and I am going to make a pull request. Okay, critical requests.
Checking that if we can merge, what are the big changes that Brooke is doing here? We can see the content here. Here she is simply adding here the root mean square, I like that. I am going to merge it. So now, we have updated the code. What happens if we tried to update the data? I’m going to go back to my branch, and I’m going to pull all the history. So I’m going to pull all the changes that just happened. So I’m going to close it and did we see the change here? Yes, I see here RMSE. So, now this notebook has incorporated all the changes that Brooke did in her branch. Now I am back into mine. Now, what happens if I tried to run all the notebooks?
How long do you think it is going to take? Well, Brooke run it in her branch, the code changes that she did got merged into this main branch, the code now is the same as what she did run, so the outcome are going to be the same. So the results are going to be the same. And this is confirmed here. DDS is telling me that everything is already cached, there is nothing to do except updating some statistics to account for the fact that now this has been the main branch. And now, I have back my results here. So, with this example, I started with a branch, I started with the code that was giving me some results, without the RMSE.
Brooke offered to make some changes, I merge them. And instantly, by running this notebook, it’s updated all the outcome so that other people can now access it. And you really see here that Brooke ran her code once, I ran the code to convince you that it was all working. But when I merged, and this is one of the critical pieces here, when I merged it, and when I reran it, it was extremely fast, there was nothing else to do, because Brooke had already done all the hard work of running her codes and making sure it was working. All I did was simply merging it here. So now we’re going to go back to the slides to wrap it up.
So I hope that this demonstration convinced you about how easy it is to merge code and then have all the data seamlessly being updated to correspond to the latest changes without having to run anything. So this really opens up after that some number of interesting possibilities into how you can use data-driven software for other pieces of the Azure Stack, in particular, it can be used as a feature store. A feature store is a fairly novel concept in ML operations, which corresponds to having a central repository of all the pre-calculated features that you may want to use in various machine learning models.
So think of the central repository, where you collaborate on the features that you build. That should sound like your family history. And indeed, if you use the load commands of DDS, then this is pretty much what it offers at no extra effort. In particular, it allows you to load any data associated to a path, in particular features. And it abstracts away how this feature gets implemented, while at the same time also giving you all the confidence that if you’re using this, if you’re loading data through DDS, it will also track all the changes that will happen when this data set is being updated, when these features are being updated.
So, it simplifies for you all the managements of dependencies, because any dependent codes will be updated, when the data set get updated, and you will see what the changes are, what other pieces of your pipeline are going to be impacted when you’re about to update a feature. A crucial piece, which we did not talk so far and which we glossed over in the demo is how DDS is able to seamlessly switch between multiple versions of data. Because in the notebook, I started with my original version of what the data looks like when I use the original version of the codes in the main branch.
Then I switched to Brooke’s branch, checked out the code, run it, saw her results, and then I merged her result back. I did some updates into the code. And it turns out that data-driven software can work in two ways. Either it can do a full checkout of the data, a bit like what you would do with a Git checkout, if you’re used to using Gits on your local laptop, and in this case, just like Gits we create all the files for you, so that you can edit them. DDS can populate and materialize all the data sets that you have. So, if you have terabyte size tables, it can be a bit slow to do, even if it uses the whole power of spark if necessary.
But at the same time, it is the most compatible way to work with other tools. Such as for example, a tool that do some ingestions and so on. So this is why DDS also has this concept of a lightweight checkout called Links Only, where all it does is that it is updating metadata. And it is not copying the whole files, because there is already a copy in the cache. If you do not use it elsewhere, there is no need to make a full other copy. And in this case, using the loads commands will be fully compatible with it, you can access all the data file that you want, just as you would before.
And it is extremely fast, as you saw here, running in a book and updating all the references in this notebook or checking them against just links is a matter of tenths of a second, really quickly. On top of that, to make sure that users have really smooth and seamless experience when using DDS, and because DDS has such a strong guarantees about how to run code and these expectations, then you can do very aggressive caching. In particular, one of the critical pieces is that when you run a piece of code, DDS needs to understand if any change happen, I need to calculate the signatures.
It turns out that DDS also under the hood reuses a lot of the work it has done if you run the code multiple times, which happens very often inside in the book when in an interactive session. And this is why rerunning a code is pretty much as fast as importing a module or when the cache is warm, it is as fast as simply accessing the data directly. It can even load it in memory so that it doesn’t even need to run it and even need to fetch it from a distributed store. So just as fast as calling a regular button function, without having to execute it. You’re probably wondering, what still needs to be done in order to make it work?
And what is the status? Can you use it? It is publicly released, it is available in the in the PyPi Python repository. You can also see it studies into GitHub. And if you want to use it, this is just a matter of running this command [inaudible] install DDS pi, what Brooke showed you in her example. And also, we’re including some of the condition and reference notebooks, if you are using the Databrick environment, so that you can see how to most effectively configure it, like I did in this demo. What are the stability guarantees? How stable is it? Can you use it in production?
So, you can consider it right now as being in a beta version, in the sense of a stable beta version. And what I mean by that is that the API is not expected to change, but some corner cases might still be adjusted for that may trigger some rerunning of calculation, because the way the signatures that I calculated might be a little bit different. And the reason for still having this room here, before calling it a stable version, is because Python is a very broad and complex language when you look into all these capabilities into all the details. And this is why there are probably still some corners that need to be fully explored and ironed out before calling it a fully stable version.
Despite what is still really to be done for stable 1.0 release, one feature which is not available yet is static functions, which is one feature of how to write some functions inside classes in Python. Some features which are probably not going to be supported from Python are usually not very useful from experience inside the data sets, if you use the whole power of DDS. In particular, if you’re using asynchronous functions, continuation and a little reflections to [inaudible] class methods, simply it is very hard to detect, and also it is not very useful in practice for most of the regular data sets code.
DDS will simply see this code, say that it is over there. But it will not necessarily track dependencies for running data sets dependencies inside. So, to conclude here, these days we can really think about … in the more general setting, we can really think about how when we think about AI and data science in general, that it is really trying to combine together data with code, because code generates data, and data depends on code to be generated, but also code after that reads data in order to do its operations.
So data-driven software is one approach to smoothly and seamlessly combine these two worlds together. And it works by breaking down the problem into the raw data, the data that you have been given as an input and you will not change, and then all the code transform that you need to do on top of it to generate, refine, clean up or model all the piece that you want to do. So, the philosophy behind it is really that you turn every data problem that you have, when you think about cleaning, when you think about modeling, by computing statistics, checking and so on. All these data problems, turn into a coding problem.
This is something that we as data scientists have been trying to do so since the studies that we took for that. So this is why I encourage you to try it out. Give a try to DDS, by selling it using DDS Pi, and we look forward to seeing what feedback you will have for that. Thank you for watching this presentation. I would like to invite you to rate and review this session as well as all the other session that you’ll be watching. This feedback is really important for us and for the organizers to be able to improve the experience that you’re having with this virtual summit here.
Thank you for attending this session. Thank you for attending the Data and AI Summit 2021, and we look forward to seeing you in the real world, face-to-face at some point into the future.
Brooke Wenig is a Machine Learning Practice Lead at Databricks. She leads a team of data scientists who develop large-scale machine learning pipelines for customers, as well as teach courses on distri...
Tim Hunter is a senior AI specialist at the ABN AMRO Bank. He was an early software engineer at Databricks and has contributed to the Apache Spark MLlib project, and he has co-created the Koalas, G...