Sonjia Waxmonsky is a Senior Data Scientist with Health Care Service Corporation (HCSC). She earned a PhD in Computer Science from the University of Chicago in 2011, and from there joined LexisNexis Risk Solutions where she developed one of the first credit-based underwriting models for the life insurance industry. Her work at HCSC covers text mining, call center analytics, and hospital readmissions. Dr. Waxmonsky also has a background in consulting and software development, experiences which she draws on in her role as a data scientist.
Data science projects involve a variety of artifacts that could potentially be memorialized, versioned, or transitioned to new owners: data sets, ETL code, exploratory analyses and visualizations, experiment configurations, modeling code, and serialized models (among other things). This talk presents a set of principles for organizing these artifacts such that - Project work is reproducible and easily transferred from one data scientist to another, - Important modeling and ETL decisions are recorded and explained, - Code organization is transparent and supports review practices, - The needs of production deployment are anticipated, - In the long term, the data science process is accelerated. Many of these principles are aligned with the design of particular toolchains for data science, such as MLflow, but all can be implemented using widely-used open-source tooling. Included in this presentation will be: