Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts

Download Slides

Data science projects involve a variety of artifacts that could potentially be memorialized, versioned, or transitioned to new owners: data sets, ETL code, exploratory analyses and visualizations, experiment configurations, modeling code, and serialized models (among other things). This talk presents a set of principles for organizing these artifacts such that – Project work is reproducible and easily transferred from one data scientist to another, – Important modeling and ETL decisions are recorded and explained, – Code organization is transparent and supports review practices, – The needs of production deployment are anticipated, – In the long term, the data science process is accelerated. Many of these principles are aligned with the design of particular toolchains for data science, such as MLflow, but all can be implemented using widely-used open-source tooling. Included in this presentation will be:

  • Options for storing or referencing modeling datasets and other artifacts
  • Treatment of PII and other sensitive information in project organization
  • Common anti-patterns for data science project organization, and their consequences

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi everybody, thanks for joining today’s session. I’m Derrick Higgins. I’m Senior Director of our Enterprise Data Science team at Blue Cross / Blue Shield of Illinois, Montana, New Mexico, Oklahoma, and Texas. I’m here with my colleague Sonjia Waxmonsky. Who’s a senior data scientist with us. We’d like to talk with you a little bit today about some practices for storing artifacts that are associated with data science projects. So we’ve worked at a few different organizations around the Data Science space — some smaller organizations, some bigger organizations, and we’ve seen how teams tend to organize their projects in terms of the artifacts associated with them. Some good practices and a lot of practices that are maybe not good ideas to replicate. Before we go into that, though, let me tell you a little bit more about us as an organization. So we’re Blue Cross / Blue Shield of Illinois, Montana, New Mexico, Oklahoma, and Texas, so we’re one of the Blues. These are health insurers who do business under the Blue Cross / Blue Shield trademark. But we’re a little bit unique among these Blues in a couple of ways. For one thing, we’re a pretty big health insurance company. We’re the 4th largest private insurer in the Unites States. We insure 1 in 12 American adults. But also because we are a not-for-profit insurer. So we are owned by our members, so we certainly have a financial incentive to control costs and ensure that we use our members’ premiums wisely. But we also put alignment with our member’s interests otherwise, ensuring that they get good access to care and support in the communities in which they live. Okay. So the first type of pattern I wanna talk a little bit about is one in which we see there’s really not even one place where the whole project lives. There’s really not even a central repository or starting point for looking at what constitutes the data science project. So, some of you have probably had conversations that go a little bit like this where you’ve got some team that you work with.

Code fragmentation

Maybe you do modeling, and this team works on data pipelines and provides input to you. And one day you see that things have changed in some way that is surprising to you. So you reach out to the person on the team who’s responsible for it and you say, “Can you tell me a little bit about whether something has changed with this variable? It seems kind of different.” And they’ll say, “Well, we did make some changes. We’re always making changes. But nothing that we think should be of any consequence to you.” “Okay, but can you tell me exactly what happened? Exactly what the changes were?” And then maybe if you’re lucky, you can get them to email you a copy of the code or a snapshot of the code in what exists today. Or maybe you have a similar conversation on the other side where you have a team that is consuming the outputs of your model, and they’re showing them to the users in some way. But the display isn’t exactly what you expect. So you reach out, say, “Can you tell me exactly how you are rolling these scores up to be shown in your dashboard?” “Well, it’s the average.” “Okay, well, the average, but is it the average of all scores in a day, or average by members in a day?” “Uh, not sure. Let me get back to you when I get the right answer for that.” So these are both kind of frustrating conversations, right? Because there’s some dependency between your project or your code and other teams, but you can’t get immediate ground truth or you can’t get transparency about how your code is operating in it’s larger context. This will not be surprising to those of you who are familiar with Conway’s Law. So Conway’s Law says, “Any organization that designs a system, defined broadly, will produce a design whose structure is a copy of the organization’s communication structure.” And in Data Science, often our teams look something like this — where there’s a data science team that does the modeling and some other things. There’s an adjacent data team that develops data pipelines. There’s a front-end team or an application team that takes the outputs of the model and embeds them in some user base in the application. And maybe there’s an infrastructure team that supports all these teams. And if you have four teams collaborating together on this project, you’re gonna end up with four different ___ And this leads to all sorts of problems as we’ve seen. We’ve got the inefficient way of communicating between teams with this manual interface where you gotta reach out by Slack or by email or whatever the tooling is that your company prefers. It’s hard to reproduce results end-to-end that involve the pieces that were developed by different teams. In general, big challenges for governance and for quality assurance.

Starting from the situation, we wanna move away from it. We wanna move to a situation where we have more transparency. We can understand what’s going on in other parts of the organization. Where there’s some versioning of code and we can state our dependency as in terms of the versions in other repositories that we depend on. And there’s, you know, maximum interoperability between the groups. But it’s not easy to do that, especially in larger organizations. For one thing, different teams can have different tool chains. So if one team is working in Python and the front-end team is working in Java Script, it can be hard to sort of negotiate that. Especially given that different teams may have different levels of technical specifications. So it may not be easy even if it’s, in principle, possible, it may not be easy for some teams to just adopt a different language or adopt a different set of tools to do their job once they’ve already got a process in place. There may be teams that don’t write code at all. They really just use GUIs or manual processes to produce their outputs. So you can obviously make it very difficult to have their process codified in a place that’s transparent. And, of course, even if we can solve all of these technical issues, there are other reasons why it might be difficult for teams to converge around a more transparent solution. You know, some teams may not have incentives to actually move towards transparency.

So, if we are able to get alignment about trying to move towards this goal, different tactics to be applied in terms of how to implement it.


We might decide, “Well, let’s take the code of all of these different teams, put it in one place, put it in one shared repository, for example, a GitHub repository, but that can be difficult to do. Especially in large matrixed organizations that have a certain degree of siloing. One approach that’s maybe a step short of that is to use looser coupling to — instead of having one common repository, have multiple repositories, one still associated with every team, but at least a versioning scheme that’s in place for each of those teams so that they can then be linked. That may also not be possible in some organizations. But at least the bare minimum we can shoot for is to have the code live someplace where it can be discovered. Someplace — you know, organizational internal public place, and to have some documentation around where that code is and who the person is to contact in case of questions.

Okay, um. Another pattern I wanna discuss is that in which instead of a repository per se, we use a folder on a drive someplace as the location where we store a project and all of its artifacts.

So, some of you out there are probably in managerial roles as I am, and have received emails like this is the past. This is actually an email that I got in the recent past when an employee left our organization. And it basically says, “Hey, this person has left, and they had a bunch of stuff on their computer. Can you please look through this and decide what is important to keep and what can be thrown away?” And unfortunately, this is the way data projects are sometimes, too often, transition from one person to another. A person leaves or a person becomes occupied with a new project, and all of their stuff is in a folder someplace that is dumped on the new person so that they can get oriented.

And it’s natural that this should be the case because, you know, a file folder is a way we’ve been storing project artifacts for time and memorial certainly before the digital era started. This is a file — a set of file folders organized hierarchically on my desk that I use to manage my household management projects. It’s got stuff about my condo board, and stuff about medical issues in my family and stuff about home improvements. So it’s a natural way to store project information. And it’s — very explicitly the inspiration for some of the mechanisms we have in place for storing information on the computer today. So this diagram is actually from a paper in 1958 by Barnard & Fein describing a hierarchical file system they had devised for the ERMA Mark 1 system. Maybe, depending on who you believe, the first hierarchal file system for computers. And you can see from the diagram it was very explicitly modeled on paper file folders that we use to store information and to group information according to category.

Problem: Lack of versioning

But when we take this analog metaphor and translate it into the digital world, we adopt a lot of the limitations of hierarchical file folders for project organization unnecessarily. In the digital world, we can do better, but — (clearing throat) in adopting this strategy, we carry over some of these limitations unnecessarily. So for example, lack of versioning. File folders do not allow for versioning of files, typically. And often, I’m sure some of you have seen projects organized like shown in this screenshot where you’ve got different versions of some Script that somebody has written, and because the file system does not directly support versioning, the author has to try to sort of shoehorn some versioning in there by adding suffixes to file names, “.v1,” “.bak,” “.final,” maybe adding the dates to some files to do some sort of ad hoc versioning. And even if we do have a file system that supports some sort of versioning, for example, SharePoint or S3 buckets, often it doesn’t track the stuff you really care about — who made the changes? Why did they make those changes?

A second problem, of course, is if you are using a folder on your disc to store everything associated with the project.

Problem: Catastrophic failure

With your codes, your data, all the documentation and so on, it lives and dies with your computer. If your laptop falls off your bike on your commute, if you have a catastrophic system failure, if you leave your laptop unattended someplace you should not. You could potentially lose all your work, and it’s gone.

And the third problem with using file folders for storing project artifacts is it just — it isn’t good for collaboration. So if your project data lives on your machine, nobody’s ever gonna find out that it’s there unless you tell them. And even if they do know to ask you for it, there’s not much you can do from a collaboration perspective. When you can email them a copy of the code that they can work with. But then when they actually use it and improve it and adapt it to new contexts, there’s not an easy way to get their contributions back into the work that you did it’s just an irreconcilable score.

So you may say, “Oh, these are all limitations of using a folder on my local drive to store data, but what about a shared drive? If you use the shared drive maybe that would solve all the problems.” And it solves some of the problems, but I would say it actually makes some other problems even worse. So talking about risk of catastrophic failure, if your code is out there on a shared drive, exposed to the whole company or exposed to everybody on your team. The more users that are working on it, the more opportunity there is for somebody to unintentionally delete something or to change something in ways that are not intended. And when you’re talking about larger development teams, some of this versioning by convention, adding dates and so on, that’s just gonna break down. It’s not gonna work. Another, I guess, more minor issue is if you’re using a shared drive for collaboration, you have to be connected to your VPN, to your organization’s internet all the time if you wanna work with that code which can be limiting.

Where to store project files

So, when choosing some place to store artifacts associated with your repository it should be versioned, the file store should be resilient so you don’t lose all your stuff, should be transparent and allow people to find your code or find your project artifact and support collaboration within different teams.

So, you know what’s versioned, and transparent, and supports collaboration? GitHub.

So a lot of teams use GitHub to — as sort of a central portal for storing artifacts associated with their repositories. But you can definitely go overboard. And we’ve seen people go overboard. Once they learn about Git, they realize some of the benefits and they tend to overuse it for things it really wasn’t intended for. Just to give one example here, I’m sure you’re all familiar with TensorFlow — large neural network library supported by Google. Recently I — (clearing throat) cloned a TensorFlow repository onto my machine. If you do that, it’s about 500 MB of data. It took me 10 minutes on my home network to do that. Understandable because it is such a large project with so many branches and contributors. But there are other projects out there on GitHub which are definitely not as complex as TensorFlow, don’t have as many contributions from independent people, but are much larger. So I found one, uh, name obscured here to protect the guilty, that is a relatively simple data science project, but because it contains a lot of image data it’s about 2 GB in size, took like 40 minutes to clone this repository into my local machine on my network.

What belongs in GitHub?

So, in terms of what does belong in GitHut, we definitely wanna put things in there that are code-like, tends to be small, they change a lot as people incrementally develop them, they’re human-readable. If you show differences between two versions of the file, they’re gonna be things people can actually review and reason about and determine whether they make sense. So examples of that would be thing like any kind of code, data pipelines, program code for modeling, configuration files that are text-based, the script that you write to set the hyperparamaters for your modeling run. Documentation in the appropriate format which is markdown. Things that don’t get so well in GitHub would be things that are binary files, tend to be very big, don’t change a lot over time. So examples there that clearly don’t fit into a Git repository would be serialized models that are the outputs of our training scripts. The data that we train our machine-learning models on. And then any sort of intermediate files that are generated in the coarse of running our data whether it’s transformed versions of our data or compiled executables. And there are always gonna be a few kind of edge cases that we can argue about. I’ve had lots a lots of discussions about notebooks and whether notebooks really belong in GitHub or should be someplace else. They’re kind of like code, you can to some extent get a meaningful gist of two versions of a notebook. But then they’re also — they also present some of the same challenges that we have with the things on the far right of the slide where they can be large if they have embedded visualizations and images. They can also potentially contain sensitive data and the output (indistinct) So, you know, we can continue to discuss these edge cases.

So when we put things into GitHub that don’t really belong there in terms of our data science project, we run into a few types of problems. One, again, it’s not good for collaboration. When it takes half an hour to clone a Git repository, that’s really sanding the gears. For another thing, it can cause problems with production deployment where if we have a bunch of stuff in our repository that should really be there, our repository’s bloated, so that can lead to architecture problems where if we want to, say, deploy that into an AWS Lambda, it may be too big for that. There may be sensitive data in our repositories that isn’t appropriate to put in a production environment. And then finally, there are gonna be integration challenges where if we have some state saved inside the repository, like the state of a data set, that can become inconsistent with other parts of the pipeline that our system runs in. Again, you run into problems. So it’s better to use GitHub for the things it really excels at. And there’s no need to abuse it like this in any case. There really are better solutions. Some better solutions in the Databricks platforms. So, specifically, delta lake is a great way to version data sets, you can store different versions of data and access them by timestamp, so identify the state of a data set at a given point and time. And then on the compiled model side, or serialized model, there’s the ML Flow model registry that allows us to train, identify, and then deploy models from our machine-learning code. And there are other solutions as well, so get large file storage if we really, really wanna use GitHub as our central entry point our data science project, then we can link other types of data to that repository Git large file storage. And there are also solutions through cloud providers. S3 and other cloud storage services do support some sort of versioning for larger files, and might be a good compliment to what’s available in GitHub. Okay. So with that, I’m gonna hand the baton off to my collaborator, Sonjia. – So Derrick just talked about ways to organize our data science projects, and some of the pitfalls of not taking steps to do this organization in advance. Next, I’m gonna talk about how we use our code and our repost to store an important output of our projects. Which is the analytics decisions we make based on our data.

So, what do I mean by analytics decisions? As data scientists, we often write code that has hidden configurations and parameters embedded inside it.

Hidden configurations – Modeling

And this includes things like modeling parameters, filters we put on our population, thresholds, and decisions we make about how we’re going to aggregate and return our data. Often times, these parameters are not given to us by the business. Instead, they come from research that we’ve done on our own data. For example, observing an age distribution to set a threshold. And I think this really what makes programming for data science different than general software engineering. As data scientists, we often start by exploring out data sets in ad hoc way and we make discoveries and conclusions. And then we take what we’ve learned to build code that goes into a reproducible pipeline for modeling and scoring.

Discoveries and EDA as a product oo

So, this analysis — the analysis that we do can take many forms. It could include standard modeling steps like correlation analysis, or reviewing variables for trends over time and fill rates. Or, we may need to understand data models or learn stories about our data to understand our end users and their problems. Now I can say that I’ve spent many years working on projects, and I’ve inherited code written by other teams, and many different formats. And I’ve learned that it’s necessary to treat this exploration as a first-class work product. This means that it should be documented and memorialized for future readers so that if we need to explain or debug these decisions that we’ve made about our configurations, then we have a starting point for our research. Sometimes I will go back to my own EDA that I’ve done myself six months earlier, and I review what I learn and I’m able to save myself hours of work and understanding of a particular problem that’s I’ve come across. So this is example of a notebook that was written by a data scientist on one of my projects. Here is was doing correlation analysis, and he saved his work in his conclusions and he wrote a clear description of what he did. Now I have a way to explain why we made the decisions we did on this project, why we dropped particular variables. And I also have a way to rerun what he did if our data set changes. So, as Derrick mentioned, there’s different options for doing work on a shared environment. GitHub supports notebook rendering, so we can put these types of notebooks into GitHub, and this has the option also that it’s pinned to our code directly. The Databricks environment also has a shared notebook workspace that has the advantage that teammates can look at each other’s work even if they’re working in different clusters.

So now, let’s talk about the next phase in the data science life cycle which is building models and storing all the outputs in results of this modeling process. So this is also a step where it’s really important to be cognizant of others who are gonna be working on other projects in the future and who’s gonna be inheriting our work.

Modeling outputs – local scope

So usually when we start modeling, we start working in a notebook, so this is an interactive environment. We can debug, we can review, we can visualize the data that we’re returning. But the downside of course of working in a notebook is that it’s only local scope. Whenever we exit, whatever we’ve done — whatever model we built in memory is lost and then all we really have left is the notebook output. So we were able to show that we built a successful model, but when someone comes along and says, “Hey, can you apply that to this new data set? Can you apply to it the 2020 member population?” We don’t really have a way to do that, we have to kind of go back and start from the beginning. So then, okay, so how do we put our models into production? Well, so, this is an example of one way to do it. And here we have a logistic regression equation that’s been pasted into our production sequel code, the coefficients have been pasted into our code. And so this, I think that everyone who works in the insurance industry has had to do something like this at least once. And this actually is fine for a GLM, but as we move more towards more machine-learning methods, we do need a different approach.

Modeling tracking with MLflow

So, specifically, we have things that we have to cache like hyperparameters, we have to serialize our models so they can be applied to future data. We’re still also gonna want to store things like test metrics, possibly store our training data sets, and other outputs that we may have. And by storing all of this, it allows another data scientist or another data science team to pick up where we left off, to take our work and continue the project when it’s needed. So there are different ways to save these types of artifacts. ML Flow for example has a Python API that can be called directly from the modeling scripts. But what is most important, whatever we’re using, is that we take the time to think about what might be needed in the future to continue this project. Do we just need to final binary? Do we have to save our training data sets? If they’re difficult to curate or can depend on when the data was created. It’s also important to allow space for saving these details in the modeling process. To take the time to tag our code, or to use a model registry to callout when we’ve reached a checkpoint model that we want to save and share and possibly apply to other data sets in the future.

So here’s an example of grid search that was modeled with Mlflow. And what you see is I kinda have a way to review all our experiments, compare the hyperparameters and the model outputs. And the results are stored in a central location where it can be easily shared in a consistent format with other data scientists.

So of course model parameters or model hyperparameters are not the only levers and offset we need to adjust in our project. When we apply a model we may need — in production, we may need to adjust things like date ranges, our point to different file stores if we’re moving from a development to a production environment. So we’ll have a lot of different parameters that we might need to adjust going forward. There’s a number of different tools and ways that we can do this type of customization. But generally what we kind of wanna do is pull all these different parameters out and put them in a central location so it’s easy to see what might need to be edited in the future. And so that we can do it without having to go back to our code repo and making code changes to support this. So one kind of straight-forward and universally understood way to do this is with command-line arguments, we can also use YAML or (indistinct) config files, and using config files allows us to have multiple concurrent configurations. If we’re possibly working on different machines but whatever we do, it’s just kind of important to think what might happen. Of course, obviously, we can’t anticipate everything that might need to be changed, that’s one of the realities of giving your models to an end user, that they’re always gonna come back with a new way of slicing the data that we hadn’t though of. But overall, this is kind of why we aim for having well-organized and transparent code. So that these changes can be done fairly quickly. Just as an anecdote, recently at Blue Cross/Blue Shield, we built a COVID-19 risk model. And this was something that — the first version was done fairly quickly, in about three weeks. And then after we deployed it, we decided to expand what we did to a larger member population. And the second model, one data scientist on our team was able to do that in about three days. And the reason all that was possible is because that we had done our original work in a way that was well-documented, and modular and transparent. And she was able to take the pieces and put it together. So overall, that’s kind of the goals that we have for this project this is kind of what we would like to convey, that our projects are well-documented and they’re transparent. And they put the work we do that’s successful today and deployed today on a path where it’s easy and well-maintained and can be picked up by others in the future. Okay, so yeah. Thank you for listening to our talk. We hope that these ideas were helpful.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Derrick Higgins

Blue Cross / Blue Shield of Illinois

Dr. Derrick Higgins is senior director of data science at Blue Cross and Blue Shield of Illinois. His team serves as a center of excellence, facilitating collaboration, providing governance, and assembling data science best practices for the enterprise. He has built and led data science teams at American Family Insurance, Civis Analytics, and the Educational Testing Service. His work has been published in leading conferences and journals in the fields of computational linguistics, speech processing, and language testing, and has resulted in ten patents. He also teaches graduate computer science at the Illinois Institute of Technology.

About Sonjia Waxmonsky

Blue Cross / Blue Shield of Illinois

Sonjia Waxmonsky is a Senior Data Scientist with Health Care Service Corporation (HCSC). She earned a PhD in Computer Science from the University of Chicago in 2011, and from there joined LexisNexis Risk Solutions where she developed one of the first credit-based underwriting models for the life insurance industry. Her work at HCSC covers text mining, call center analytics, and hospital readmissions. Dr. Waxmonsky also has a background in consulting and software development, experiences which she draws on in her role as a data scientist.