– Hi, my name is Petra Kaferle Devisschere. I’m Data scientist at Adaltas. And today I would like to discuss with you about data versioning and how it can increase the reproducibility of machine learning projects. And how do you see an Ml flow can help us with that. But let me first introduce myself and the company. I joined Adaltas recently as data scientist and before I had 13 years of experience in data analytics in life sciences where I was developing new methods to analyze the data and I was also automating the most frequently used protocols. Adaltas is involved in big data since 2010. We are based in France and we are also present in Morocco. We have partnerships with Cloudera, Databricks and D2IQ. We contribute to open source community and we also teach big data at six different universities in Paris. So today, we will look at versioning in data science on general then more specifically in data versioning we will look what DVC and ML flow are and what they do. And at the end, we will do a demo to see how we can use them. So first, let’s see what do we need to track in data science project to be able to reproduce the results. So, first is the data set. The data can be coming in continuously we can do some feature engineering we can select different columns for different experiments. And know that we would like to track. Then we have the code. The code can be about data pre processing about modeling. During the modeling phase, we will be trying a lot of different parameters or hyper parameters. And all these will be producing different results. And the difficulty is not to track the each component individually but to track the dependency of which data set and which code will produce which result. However, data is not only part of data science, right? Data is everywhere. And data is changing dynamically, all the time. So we acquire more and more of it. And that’s why data versioning is important also in other domains. So here, we have some additional examples where that might be but the list is definitely not exhaustive. So for example, in data engineering during the data cleaning and dataset preparation then in software engineering where we have the test data with which we will test the functionality of the software on different levels of granularity for example, in for unit testing, for integration testing and for functional testing. Then the same goes for database applications where we will initialize a database and populate it with some test data to see if the data model corresponds to what we want. Then there’s the whole community trying to establish the good practices for ontology versioning for Semantic Web and we could go on and on. However, today we will focus on data in machine learning and in data science. So why exactly is it important that we know the exact version of the dataset that we are using? So first of all already for data quality management and assurance so that we know that we can tell at each point even after the model was already built and put in production. That this data was of good quality to know if it was complete. or if that was really the best data set at the time. This can go hand in hand with audit especially in more sensitive domains like healthcare, where it’s very important why we took certain decisions then more on general we would like to have reproducible training. So when we train the same model with the same parameters we would like to have the same results otherwise we cannot optimize it. then to track. The quality of already of models that are already in production to know if we need to retrain them. Of course, for automation of testing and of deployment, we need to know exactly which version of data we tested. And at the end, when you have a product, which can be an application or a web service the product can also be as good as the data behind it. So there are different solutions. But mostly they they were adopted from software engineering where tools for versioning of the code already existed. So probably the most famous being Git. But it turns out that Git is not suitable to version the data. Why? Because the datasets are very big. So Git was designed to work with smaller files but datasets can get large very fast and even Git extension LFS so large file storage has file size limitation. Limitations which can be exceeded quite fastly with with the big data size that we have today. So the alternative could be to set up the external storage, where we could calculate the checksum ourselves every time that we modify the data but that kind of methods require a high level of consistency and they can easily deteriorate with time. Therefore, we would like a solution which is easy to implement. And today we would like to suggest a combination of DVC and mlflow. So DVC will track and version the datasets. And mlflow will record the information about the exact dataset used in experiment. So DVC solves its limitations. And it’s easy to learn since its uses its vocabulary, and mlflow It’s also easy to use and it offers you to to track all the objects that you can access through your code. On top of that, it has a very nice user interface. And at the end, everything’s only a click away. So let’s look at them individually. What they do how they function. So first, the DVC. DVC tracks dataset and datasets and machine learning projects. And it also supports building and running pipelines. It allows us to choose where we want to store the data which can be locally in cloud in HDFS and in many different locations. And it runs on top of a Git repository. So the way it functions is that DVC versions a dataset and it creates a small file which will be version by Git. So DVC dataset information about the data set Git. Mlflow. So, ml flow is a is a platform to manage machine learning lifecycle. It also has many modules. So, it helps you to track your experiments you can package your model we did and you can also serve the model but we will only only focus on experiment tracking today. So now how we can use them together to achieve the reproducibility of our project. So first of all let me say that you need get installed and virtual environment with DVC and mlflow and We’ll be using the data coming from mlflow Git repository, and the code also coming from there. So first, we will initialize local repository with the Git and DVC. Then DVC will track the data set that we will copy inside. Git will track the information about this data set that will be produced by DVC. So then we will push the dataset to a remote storage. And if we want to access the executive version of our data with our code, we will use DVC API. And we will track the details about the dataset along with the metrics of our model with mlflow. So now let’s continue with the demo to see how it works. We are in a directory demo, which we will initialize as a Git repository and as a DVC project. So DVC will track the dataset itself. But the Git track only the information about this data and about its changes. But it will include the name of the file containing the data to Git ignore file and therefore the data itself will never get pushed to Git repository. We will see how this works during the upcoming example. But now let’s see what happened during initialization. A lot of files were generated. But I would like to point out the DVC config file to which we restore the location of our remote storage which we’ll configure just after committing those changes. So let’s configure the remote storage now. We will place it in a temp folder. And with this we would like to illustrate the DVC really allows you to choose where you want to store your data. So you’re not obliged to push it to some remote server or to put it in cloud. You can also store it on premise even on your local computer. And this is very important if you work with the sensitive data. So now we can check the content of the DVC config file. And we see that the location of our remote storage is really here. So we can omit this change to meet. And now we are ready to start tracking our data. So let’s imagine that you have a dataset, which you already used before or maybe you just got it and you would like to start tracking and versioning it since the beginning. So the first one, what you need to do is to copy it to this repository. So let’s first create a new directory called data. And now we will copy the existing data that we want to track to this data folder. So the data set is stored in my ml project. And it’s named one quality. So I took it from mlflow repository. And now it’s copied to the data folder. And now let’s see what’s inside. So it’s this one quality CSV file over size 264 kilobytes. Now if we want to start tracking it, we just add it to DVC. So DVC add and the name of the file. And now we can see again, what is in our data folder. So here we see that a new file appeared with the extension .dvc. And if we look what’s inside, we see that it contains the information about our dataset. So it’s md5 hash and the name of the file. Now if we look at the content of Git ignore file, we see that it really contains the name of our data file. So the data will not get pushed to git repository. And then and so and this is the way that the git layer and the dvc layer are separated. Okay, now we need to add both files to dit. And we need to commit. So track. And another thing that we will do is we will create a tag for each data set, for each version of the data set which will make it easier to access the exact version later on in the process. But the data still only resides in our data folder. It still didn’t go to the remote storage. So to copy there we need to do the receipt. So to copy the data we need to do dvc push. Now we can look the remote storage. What we have there. So we see that we have a copy of our dataset. So we the same size as before 264 kilobytes but the name changed. So it’s together with the name of the folder, it’s a hash of the data. Okay. So now since the data is stored in the remote storage, we can remove it from our data folder because we don’t need it anymore. But here, you need to be careful to really only delete the dataset and not the dot dvc file because then you lose the link with your data. And you need to clone it back. But there’s another location where a copy of this data appears. And this is in not dvc cache. So we can remove it also from here. Okay. But now, if we decide that we want to work with this data again, we bring it back by running dvc pool. So if we look again, in our data folder we see that the data is back with its original name. Okay, now let’s modify this data, let’s create a new version. And we will do it with a very simple modification with the addition of thousand lines. We can track that the modification really took place. So the size of our of our file is 211 kilobytes now, which is different from 264 before. So we modified it successfully. And now if we want to copy it, to add it to remote storage, we need to repeat the same procedure as before. So we need to dvc add and the name of our data we need to git add and the name of .dvc file. We need to git commit all the changes. And we need to create a tag. And for now for us now this is version two. And now we still need to copy the data to the remote remote storage so we need to run the dvc push. What is left is to delete the data where we don’t need it anymore. And some cache. And we are done. So what we can do now we can look at the git logs, and we see that we would git commit messages and with the names of the versions, it’s very clear how the data was modified. And when and by whom. So tracking the changes is very easy. And it’s clear already from git log. But now, imagine that you want to access an exact version of your data through a code to use it for your machine learning project. To illustrate that I used a script also from mlflow repository, which which I modified. And now we will see and discuss these modifications. So let’s look at the code. And I will explain what what I added to achieve the functionality of tracking the data set through dvc and mlflow. So first, I needed to import the dvc api which has a function get URL in which I defined a path, a repository and the revision. So the revision can be git commit ID which is not very user friendly, it can be a tag, it can be many things but I found the tag to be a very meaningful description of your dataset. So you can put in whatever you want. And for the data scientists working with the data sets, it can already mean something about the data that they are using. Okay, so we extract the URL of the exact version of the data. And we pass this URL to the function that will open this data. But this function is the function you’re already using. So in this case, it’s pd read CSV, so you don’t have any additional step here. Basically, you just need to point the location of your data set. And the code stays the same from this point on. So now, we will track some details about our data with mlflow, we will save them as parameters meaning key value pair. And we will track data URL, the version, how many rows were in the input data set and how many columns. And then we will also add as artifacts. So meaning we will save in the file. Names of all the columns that were used to train the model. So we will store them as features and names of the columns that we were predicting. So in this case, it was only one but it will be stored in targets. So now let’s run the script. Now we can go back and we change the version. So for the first run, we were using version one. Now let’s try with version two. And let’s rerun the script. So you can see immediately that the values of the metrics are not the same which means the different data sets were taken into account. But now we can go to ml flow ui. Where we’ll look at the things more in details. And here already from the dashboard, you can see that there were differences in the datasets. And what those differences were. The number of rows are not were not the same sort of versions were different. But if you click on the run so you can get the exact URL of your data version, and all the other details that you stored. And here at the end, we have the access to all the names of all the columns that were used during the training and the name of the column that we were predicting. So if you want to share with somebody you can just download it. But even if you don’t, you just you can just keep it as a part of your project because it will increase the readability and reproducibility of your work down the line. Because even though, you can extract this information from the code, after we stopped working on a project and we’ve started forgetting what was going on, it takes a lot of time to remember and to get out all these details, while just tracking them on the go. It’s much easier and it will help you in the long run. But as I said before mlflow and DVC are very complex tools with with many functionalities. So since we are limited in time, here during this demo, I encourage you that you go and you look at them yourselves, and maybe try to find out what might work might work for you. Okay, so now let’s do a step back. And let’s look again at what we just did to make things clear. So we initialized a local repository with Git and DVC. We copied our data set inside the data set was version with DVC, which also produced a .dvc file, which was versioned with Git. We pushed the data to a remote storage. And when we wanted to access the exact version of a data set through the code, we used DVC API and then mlflow to track the details about the data set, along with our machine learning metrics. So with this I would like to conclude the talk. Thanks a lot for your attention. If you have any more questions, please let me know and have a nice day.