Patterns and Anti-patterns for Memorializing Data Science Project Artifacts

Data science projects involve a variety of artifacts that could potentially be memorialized, versioned, or transitioned to new owners: data sets, ETL code, exploratory analyses and visualizations, experiment configurations, modeling code, and serialized models (among other things). This talk presents a set of principles for organizing these artifacts such that – Project work is reproducible and easily transferred from one data scientist to another, – Important modeling and ETL decisions are recorded and explained, – Code organization is transparent and supports review practices, – The needs of production deployment are anticipated, – In the long term, the data science process is accelerated. Many of these principles are aligned with the design of particular toolchains for data science, such as MLflow, but all can be implemented using widely-used open-source tooling. Included in this presentation will be:

  • Options for storing or referencing modeling datasets and other artifacts
  • Treatment of PII and other sensitive information in project organization
  • Common anti-patterns for data science project organization, and their consequences

Register Now
« back
About Derrick Higgins

Blue Cross / Blue Shield of Illinois

Dr. Derrick Higgins is senior director of data science at Blue Cross and Blue Shield of Illinois. His team serves as a center of excellence, facilitating collaboration, providing governance, and assembling data science best practices for the enterprise. He has built and led data science teams at American Family Insurance, Civis Analytics, and the Educational Testing Service. His work has been published in leading conferences and journals in the fields of computational linguistics, speech processing, and language testing, and has resulted in ten patents. He also teaches graduate computer science at the Illinois Institute of Technology.

About Sonjia Waxmonsky

Blue Cross / Blue Shield of Illinois

Sonjia Waxmonsky is a Senior Data Scientist with Health Care Service Corporation (HCSC). She earned a PhD in Computer Science from the University of Chicago in 2011, and from there joined LexisNexis Risk Solutions where she developed one of the first credit-based underwriting models for the life insurance industry. Her work at HCSC covers text mining, call center analytics, and hospital readmissions. Dr. Waxmonsky also has a background in consulting and software development, experiences which she draws on in her role as a data scientist.