Dr. Derrick Higgins is senior director of data science at Blue Cross and Blue Shield of Illinois. His team serves as a center of excellence, facilitating collaboration, providing governance, and assembling data science best practices for the enterprise. He has built and led data science teams at American Family Insurance, Civis Analytics, and the Educational Testing Service. His work has been published in leading conferences and journals in the fields of computational linguistics, speech processing, and language testing, and has resulted in ten patents. He also teaches graduate computer science at the Illinois Institute of Technology.
Data science projects involve a variety of artifacts that could potentially be memorialized, versioned, or transitioned to new owners: data sets, ETL code, exploratory analyses and visualizations, experiment configurations, modeling code, and serialized models (among other things). This talk presents a set of principles for organizing these artifacts such that - Project work is reproducible and easily transferred from one data scientist to another, - Important modeling and ETL decisions are recorded and explained, - Code organization is transparent and supports review practices, - The needs of production deployment are anticipated, - In the long term, the data science process is accelerated. Many of these principles are aligned with the design of particular toolchains for data science, such as MLflow, but all can be implemented using widely-used open-source tooling. Included in this presentation will be: