This post is the third in a series on Bayesian inference ([1], [2] ). Here we will illustrate how to use managed MLflow on Databricks to perform and track Bayesian experiments using the Python package PyMC3. This results in systematic and reproducible experimentation ML pipelines that can be shared across data science teams due to the version control and variable tracking features. The data tracked by MLflow can either be accessed through the managed service provided through Databricks either using the UI or the API. Data scientists who are not using the managed MLflow service can use the API to access the experiments and the associated data. On Databricks, access to the data and the different models are managed through the ACL that MLflow provides. The models can then be easily productionized and deployed through a variety of frameworks.
MLflow is an open-source framework for managing your ML lifecycle. MLflow can either be used using the managed service on Databricks or can be installed as a stand-alone deployment using the open-source libraries available. This post primarily deals with experiment tracking, but we will also share how MLflow can help with storing the trained models in a central repository along with model deployment. In the context of tracking, MLflow allows you to store:
This section only applies to the open-source deployment of MLflow, since this is automatically taken care of with the hosted MLflow on Databricks. MLflow has a backend store and an artifact store. As the name indicates, the artifact store holds all the artifacts (including metadata) associated with a model run and everything else exists in the backend store. If you are running MLflow locally, you can configure this backend store, which can be a file store or a database-backed store. You can run a tracking server anywhere if you so choose, as shown below:
You can then specify the tracking server to be the one you set above as:
On Databricks, all of this is managed for you, minimizing the configuration time needed to get started on your model development workflow. However, the following should be applicable to both managed and opne-source MLflow deployments. MLflow creates an experiment, identified by an experiment ID, and each experiment consists of a series of runs identified using a run ID. Each run has the associated parameters and artifacts logged per run. Here are the steps to create a workflow:
Once the experiment has completed, you can go back and inspect the MLflow UI or programmatically extract the run information. For example, if the current experiment ID is ‘10618537’, you can extract the information about the experiment:
Assuming that you know your experiment ID, you can search for all the runs within an experiment and extract the data stored for this run, as indicated below:
The artifacts associated with this run can be listed as shown below. The file size and path are shown for each file
MLflow manages the artifacts for each run, however one can either view and download them using the UI or use the API to access them. In the example below, we load the trace information and the trace summary from a prior run.
If you run the above, you would notice that the trace summary contains the same information as before. The estimates of the parameters that were loaded from the artifacts file or the trace summary, as indicated by their distributions, now become the parameters of the current models. If desired, one can continue to fit new data to our model by using the currently estimated posteriors as the priors for a future training cycle.
In this post, we have seen how one can use MLflow to systematically perform Bayesian experiments using PyMC3. The logging and tracking functionality provided by MLflow can be accessed either through the managed MLflow provided by Databricks or for open-source users through the API. Models and model summaries can be saved as artifacts and can be shared or reloaded into PyMC3 at a later time.
To learn more, please check out the attached notebook.
Check out the notebook to learn more about managed MLflow for Bayesian experiments. Learn more about Bayesian inference in my Coursera courses: