Machine learning suffers from a reproducibility crisis. Deterministic machine learning is incredibly important for academia to verify papers, but also for developers to debug, audit and regress models.
Due to the various reasons for non-deterministic ML, especially when GPUs are in play, I conducted several experiments and identified all causes and the corresponding solutions (if available).
Based on these solutions I developed mlf-core (https://mlf-core.com), which provides very sophisticated CPU and GPU deterministic project templates based on MLflow for Pytorch, Tensorflow and XGBoost. A custom linter ensures that models are deterministic at any point.
Speaker: Lukas Heumos
– Hi everyone and welcome to my talk. Deterministic machine learning with mlflow and mlf-core. Very briefly about me, so I’m Lukas Heumos and I’m based in Tubingen where I’ve also received my Bioinformatics Masters from the university of Tubingen, and I am a research Software Engineer at the Quantitative Biology Center in Tubingen. And one of the things that I particularly focus at with my research is reproducible research. Very briefly about the Quantitative Biology Center, or QBiC, for short. So we are a Bioinformatics core facility at the University of Tubingen. We primarily conduct data management and data analysis for our customers, usually in the biology area. And we are a strong contributor to reproducible research, where we basically develop tools and workflows for reproducible research. We also have a job opening for a scientific data steward that you can access via the shown QR code, but you can always reach via your preferred means. I’m sure that most of you have already come across one of these headlines. For example, Machine Learning is Creating a Crisis in Science, or Artificial intelligence faces reproducibility crisis, but why do we actually care about these issues? So from a scientist perspective, we base our work on hypothesis, and we have to verify our hypothesis, because if we don’t do so, then our next hypothesis that we basically put up might be based on completely wrong results. And there search is so little to even fill that from results. A few researchers evaluated reproducibility in machine learning. So they took a look at 400 papers and they found that only 24% of them were actually reproducible ,quite a low number. Furthermore we need to be able to audit machine learning models. So machine learning is not used in all kinds of sensitive application areas such as terrorism detection or in health and the public needs to be able to audit those models to actually verify that they do what they’re supposed to do. Now, from a developer’s perspective we conduct a lot of experiments and with our models and our code and our models need to be deterministic because if we don’t know whether and what currently observed is due to our experimental setting or due to random factors, then we can’t experiment at all. Furthermore, when we are actually debugging our models we need to pin the origin of that error and tightly, but we can’t do so if you don’t know whether our different model performance is due to randomness or due to the buck that we’re actually trying to track. And finally, if we want to ensure that our model still performs as it’s supposed to do for example, when we add additional data and we need to be able to form regression tests. They can only really tightly verify that our model is doing what it’s supposed to do when it’s deterministic. But what are the primary reasons for non-reproducible machine learning do. First and foremost, primary reason is that the data and the code is not shared. This was also found in the paper that I’ve just mentioned. Secondly, the documentation is oftentimes insufficient. So the hyper parameters, the metrics and all kinds of numbers are not reported. Furthermore the used hardwares very oftentimes is not reported at all. But also machine learning is oftentimes conducted in a irreproducible environment. So people use different kinds of versions for the libraries and that just leads to conflicts and completely irreproducible results. Finally, especially today, the usage of GPUs it’s really widespread due to the performance gain that they have provide us with, but GPUs usually use non-deterministic operations and non-deterministic algorithms. So what could it actually look like? And what you can see here on the slide on the left is a depiction of the so-called sum-reduce algorithm which is based on CUDA, the underlying implementation or the underlying API that NVIDIA provides for GPUs. Now, if you focus on the left, you can see if you start at the very top, you’re always just adding numbers in pairs. For example, you add the 57 and the 19 first and you get 76, then you add the 52 and the 14 and you get 66 and you just continue this. And if you go to the next row, all you do is you add up two numbers. So what we’re trying to do here is we have a lot of numbers that we want to add them all up to get the full sum. The sum-reduce algorithm. The issue here is that those GPUs that we are using to calculate those sums. They operate in parallel and actually summing all of that up requires synchronization. But when we make the problem more difficult and now turn our eyes to floating point numbers, then things start to get weird. Now, here you have two experimental runs basically. We are calculating the sum twice using the same numbers. If you look on the left, the first thing that we do is we add 0.1 and the 0.2, and we end up with 0.3 something. And next we add the 0.3 and we obtain sum number. For example, here, it’s 0.6, a lot of zeros in the nine. So it’s not perfectly precise. There’s some imprecision here. Now, if we actually switch up the order and calculate the sum of 0.2 and 0.3 first, this time we ended up with 0.49 and many repetitions of the nine and then an eight. And then at 0.1, suddenly we get a completely different number. So it’s 0.59 something. Now, why is that? Well that’s because this summation is not associative but if we’re doing that with GPUs, the primary issue here is that the order of the thread synchronization is always different. And this leads to different floating point errors. You can imagine that if you apply that algorithm many, many, many thousands of millions, even billion at times those very small errors that you can observe you have to actually add up. Yeah and most machine learning libraries are based on these atomic operations. Even though we’ve just talked about this issue there are many more reasons for non-deterministic behavior but this is the most prominent one that people are usually not aware of, but they have put some recent developments. So for the major machine learning frameworks such as PyTorch and TensorFlow but also XGBoost that are now actually deterministic algorithms in the variants offered, and they are implemented without these atomic operations that I’ve just mentioned. But there’s a couple of questions that we can ask. So are those deterministic algorithms actually working as expected? Are there options for all the algorithms available only for a subset for them and what is the effect on the runtime? Therefore, the first thing that I evaluated here is yeah all these questions, the following setting. So I have two basically containerized projects that they are running in Docker and singularity containers and PyTorch TensorFlow and XGBoost. Their PyTorch and TensorFlow settings. They are basically trying to classify handwritten digits that you can see in the top right on the left image from zero to nine and an XGBoost case I was working with the cover type data set and actually try to classify three cover types into different classes. So this was kind of the aim of those settings. And I used three different systems to have different hardware sets to tackle the hypothesis whether the hard drives had the effect on the determinism. So system one, which was just a personal laptop had an i5 and a NVIDIA at 1050M the second system has the 12 core Intel system and had two NVIDIA Tesla k80s and a third, and the fourth system was a little bit more recent. It had 24 core and 2 NVIDIA Tesla V100s. So when I was evaluating the determinism on off of these projects I had three settings that I wanted to evaluate for the CPU and a single GPU and even multiple GPUs. So for each three, sort of each one of those three, I was running the experiments with no random seeds set at all basically our basic hypothesis then with all possible random seeds set. And then with the deterministic algorithms that I just mentioned enabled and with our random seed set and that was basically running those settings five times per setting. In the forum we will only look at the PyTorch results due to interest of time, but the TensorFlow and XGBoost results basically tell a similar story. So what you can see here is the box plot of the PyTorch run with system one. So basically my laptop and what we’re plotting here is the setting against the loss on the y-axis. So let’s start from the very left. If we have the setting the system one on the CPU with no seeds set at all, you can see some variance. So it’s not a flat line but it’s a box plot that actually yeah has a body. So there’s some variation which we don’t want to have. Now, if you focus on the second setting, we suddenly enable the random seeds and we get a deterministic result. If we enable even further deterministic settings there’s not any effect of that. We still get deterministic results which is what we want to have. Now if you suddenly start using the GPU, which is the fourth plot and you don’t set any setting, any seeds at all. You have a non-deterministic result and now things get more tricky. And basically this demonstrates the issue that I mentioned earlier. If you now fix all of the random seeds, basically every seed that you can find then you still get some variation and don’t have a deterministic results. But finally, if we enabled the deterministic algorithms then yes, we do get a deterministic result which is what we wanted to obtain. But let’s switch hotter for a moment. This time we’re doing the same thing but with the system tool which actually has not a single GPU, but two GPUs which allows us to test the same hypothesis for multiple GPUs. Now the first three box plots basically resemble a story that I mentioned earlier for CPU. Okay if we enable all of the random seeds and enabled to deterministic algorithms suggest things are deterministic. But things get interesting when we use GPUs. So if you focus on the fifth plot now basically in the middle, if you run your model now with a single GPU and with only the random seeds enabled we suddenly have deterministic results, now this is not what we expected because in the plot area that I’ve just shown you, I’ve told you okay, if you only enable the random seeds for GPUs then you don’t necessarily get a deterministic result. So why could that be? One reason that this could be attributed to is that the underlying neural network implementation that NVIDIA provides which is basically called cuDNN, and that has a benchmark setting. And this benchmark setting often tries to find the fastest algorithm for a given hardware. And I expect that in this case, it’s basically selected the algorithms which operate deterministically. And therefore we obtain this result but you cannot always bet on that. So yeah, the figurity and therefore you can still enable the deterministic settings which is the six figure here. And then you do always get the deterministic results but they differ here. If you disabled the deterministic results and only run the model with all random seeds set. The same story can in this case can be upstairs. If we do it with multiple GPUs, which are the through plus on the right. One thing that you should know is that as long as you keep the hardware consistent, so if you use the same hardware architecture then you should always get the absolute same results given all deterministic settings are enabled. So in this plot, you can basically see system three and system four which we have different systems but had the same hardware and all the deterministic settings RS enabled. So the first two box plots are for the CPU, same hardware deterministic settings, same results, same story for a single GPU and the same story for multiple GPUs. Maybe you noticed earlier that when we compared system bond and system tool, that even if you enabled all of the deterministic settings, the results are actually not quite the same. They live a little bit. So what can we take from that? If, you know and evaluate the run time. So, should we enable those settings always or not at all? Because you can imagine that there’s a reason why NVIDIA implemented those atomic operations. They are basically the fastest way to calculate a sum. What is the effect on the run time? If you just compare it to the GPUs, which is what is shown in this plot, you can see that the three box plots on the very left, which are the single GPU and the third one. So the green one is the one that we care about the most because this is the one that has the deterministic settings enabled in this case for only five runs and keep that in mind. And the run time was even lower than when we only had our random seeds set. So really there was no effect on the runtime at all. The same story and for the next three plots, the next three figures here if we can take a look at the multi GPU setting, then in this case, the runtime was a little bit higher than when all only our seeds were set, but it was lower with no seeds were set at all. So basically in the case of PyTorch, it really is. Yeah, there’s a reason not to turn the settings on because there is no real up-sale effect on the run time. What are the primary takeaways from experiments that I conducted? So, first of all, we’ve discussed PyTorch but I’ve mentioned it, similar results were obtained for TensorFlow, but also for XGBoost. So those deterministic algorithms they do work, but they are surprisingly, barely tested. So during my evaluation I found several bugs and I reported them and got them fixed together with the library developers but still I’m sure that there are many more issues that are actually hidden somewhere. And those deterministic settings, they need to be forced because sometimes the underlying implementation for example, by cuDNN choose algorithms that might sometimes appear too deterministic, but there might not always be. Furthermore not every algorithm actually has a deterministic option. So some of the algorithms should not be used at all. If determinism is of concern, it’s very difficult to actually get complete lists. So PyTorch, for example mentioned that they expect to have about 150 functions that are not deterministic, but they don’t provide you with a full list. Therefore it’s even harder to actually keep track of those functions in your code. Because as usual, you’re writing a lot of code and you can’t keep track of which of the functions you’re allowed to use and which ones you’re not allowed to use. The more determinism is actually hardware architecture dependent. So if you train your model on two completely different hardware sets then don’t expect to get the same results. Even if you have all determinism settings enabled. Also what we’ve just seen this really a neglect able effect on the runtime. So Duncan Riach from NVIDIA, he conducted even more experiments that either on a formal, a larger model, and according to him for TensorFlow, that’s about a 6% performance decrease. If you enable all of the deterministic algorithms. I personally consider this to be fine because you’re saving a lot of time with determinism and yeah, you can really speed up the experimentation and yeah, so it pays off I think. In summary, what are the requirements for deterministic machine learning? So based on the experiments that I’ve conducted we should keep track of all of the hyper-parameters and the metrics that we actually obtained and document our model in depth. Furthermore we should have run our model in a reproducible container so that we can ensure that we always use the absolute same runtime library versions. And thirdly we should always enable all of those determinism settings for the machine learning libraries. And finally we should keep track of the hardware that we actually use because if we switched to different hardware we also won’t get a reproducible result. So if we adhere to all of those requirements and we should get reproducible runs and always the absolute same result. But these are really complex requirements. There’s a lot that you as a developer have to keep track of. So what we actually need here is a intuitive software solution. The solution that I will now present is based on mlflow and by databricks. The primary three components that we are making use of is first of all the tracking component of mlflow, which allows us to record and query our experiments that allows us to keep track of our hyper parameters that we’re using for the metrics that we obtain and these kinds of things. Secondly, we will make use of mlflow projects which allows us to package our code and our project the reproducible format that we can then run on any platform. And thirdly mlflow models allows us to really deploy our model in an easy manner and keep track of all of the models that we’ve trained. Based on mlflow, I would like to introduce to you mlf-core which is inspired by the nf-core. And it’s really designed to enable deterministic machine learning. Just a small overview. What does mlflow actually give you? So mlf-core provides a super easy way to create projects based on project templates. These project templates they actually are already deterministic from the get-go. So you start with a CPU and GPU deterministic project template. Secondly, those templates come with a super rich continuous integration pipeline. So for example, a Docker container is built right from the beginning for the project. There are various code Lintas. So Lintas are tools that statically analyze the code that are running from the get go. Your project trains on a small training subset to verify that it’s always working these kinds of things. Thirdly mlf-core implements, a custom linter. Now this linter we will discuss it a little bit later again but this linter will aesthetically analyze the code to verify that it adheres to all of the mlf-core standards. And of course, as you can imagine one of the mlf-core standards is that your code your model is always deterministic at any time. Furthermore mlf-core has a so-called sync feature. So we from mlf-core, we will always improve and the mlf-core templates, but if you created a project say two years ago and we have released a new version of mlf-core with an updated template, for example and for the PyTorch template that you used two years ago then you will automatically get a pull request with only the changes that we’ve made to that template. This ensures that you’re always up to date, like you always use the most recent evidence of course, standouts to standardize your project and yeah. To really ensure that it will always be deterministic. Furthermore mlf-core it’s well documented. And yeah, you really will find a lot of help and documentation to get started, but also to use some of the more advanced features. And finally, we are trying to build a community around mlf-core because deterministic machine learning reproducible science are really important to us. We want to ensure that even in the future when things will change, and of course obviously be able to provide deterministic templates for you. What are these available project templates? So at the moment, and we have four project templates in mlf-core. So first one is a PyTorch template. The second one is a TensorFlow template. And the third one is a XGBoost, a base template. And the fourth one basically combines XGBoost together with Dask, which is kind of comparable to Apache spark, but more used in the Python community. And this allows you to use an XGBoost to train an XGBoost model on multiple GPUs even which is not supported after box for XGBoost. And let’s take a look at a small, just that basically demonstrates the process of creating an mlf-core project. Let’s start from the beginning. There we go. So first you will be asked for are the primary framework that you want to use. In this case, we selected PyTorch. We enter project name and we enter a project short description. Then we select a current version and license and whether or not we want to create a GitHub repository automatically. Now, if we do so, if we want to create a GitHub repository then it will automatically create one for us, even for organizations or for private repositories. And it will automatically upload our approach a code from the template to that repository. And now this huge kind of colorful blob that you can see here is the mlf-core linting running ones which are verified that the project that you’ve just created does actually adhere to all of the standards. And we will take a look at that later in more detail, but just to give you kind of an impression it’s super easy and intuitive and you have just basically have to enter a few things and select things with the arrow keys. Super intuitive, and you end up with a complete project. So what you get is this set up here? So a lot of files that we really can’t discuss in detail, but okay. It comes with a full, could have action set up with an mlflow set up, a CONDA set up, Docker set up and Read the Docs and yeah a lot to get started with. Now, if you train one of those models with mlflow, you can see here that it basically gives you the starting time after parameters models frustrate with all of the metrics. So this is all locked for you. Secondly, if you train one of those models to get a few additional reports besides the model that I usually keep track of. So you get a system intelligence report that shows all of the GPUs that you’ve basically used. So to have a parameter tracking and the GPU tracking here that you have a system intelligence checks complete half effort so to solve kind of the tracking problem as we talked about earlier. Now if you run mlf-core lint manually, basically what you end up with this, it runs a whole bunch of tests. For example, for the versions that you are approaching tests use are still up to date. Yeah. What do you need to update them at the end? It’s the two strings left. And at the very bottom, I want you to focus on that. You can see some test failures in this case. So it will tell you if you use any of the non-deterministic algorithms if you deleted any of required seeds and to make you but you, as a user, you don’t need to be aware of all the stores and issues because mlf-core lint will basically tell you on what you need to fix in order to make a project deterministic. This solves the requirement for the deterministic settings and to bring it all together and mlflow together with Read the Docs solved documentation issue. CONDA together with Docker solved container issue. And finally mlf-core ensures that you’re always using the deterministic algorithm settings, but also keeps track of the hardware. Therefore we get rapid with the machine learning. You can go even one step, fill out from mlflow together with mlf-core and basically create Pypi and CONDA packages from that. And you can then integrate them into more complex pipeline setup such as next flow or airflow. So really whatever you want therefore get deterministic entry in pipelines. And further work, what are we going to do? So we will be trying to add more templates even more machine learning libraries that are supported in mlflow. We will try to provide templates. That’s basically give you a simple way to create Python pickets just integrate them into more complex pipelines. You will be working on improving existing templates to provide even a cloud configurations permit optimization and these kinds of things. And we will also try to add a couple of more advanced example and rescore projects for popular architectures, for example for a gun or for a version variation, auto encoder these kinds of things for reference implementations. And therefore we would like to invite you to join mlf-core. So it’s some Pypi just insert with pip install mlf-core license and a Petra two license. Furthermore join our discord if you want to ask any questions about deterministic machine learning in general, or the mlf-core package contribute to the cord if you want to and on GitHub take a look at our website on the mlf-core.com. Finally, it’s just a few acknowledgements. Now I would like to thank Sven Nahnsen, Phillip Hennig, Gisela Gabernet who were the supervisors for my master thesis on which this is based on. Duncan Riach from NVIDIA for a lot of helpful comments. nf-core project for inspiration and the deNBI cloud for providing me with adding machines to run those experiments on. Thank you very much.
University of Tübingen / Quantitative Biology Center Tübingen
Lukas Heumos is a research software engineer, with degrees in Bioinformatics, at the Quantitative Biology Center, Tübingen. As part of his scientific efforts he conducts research in reproducible bioinformatics workflows. Based on these experiences he now leads the endeavor of enabling deterministic and even replicable machine learning with mlf-core.
The passionate open-source contributor and hackathon enthusiast was awarded the 2019 University of Tübingen award for exceptional student commitment and was accepted as a Lindau Nobel Laureate Young scientist for the 70th Lindau Nobel Laureate Meeting.