At Comcast NBCUniversal, Nabeel Sarwar operationalizes machine learning pipelines under the banner of improving customer experience, operations, field, and anything in between. In the process, he oversees data ingest, feature engineering, and the generation and deployment of the AI models. He has a BA in astrophysics from Princeton University.
Our team at Comcast is challenged with operationalizing predictive ML models to improve customer experience. Our goal is to eliminate bottlenecks in the process from model inception to deployment and monitoring. Traditionally CI/CD manages code and infrastructure artifacts like container definitions. We want to extend it to support granular traceability enabling tracking of ML Models from use-case, to feature/attribute selection, development of versioned datasets, model training code, model evaluation artifacts, model prediction deployment containers, and sinks to which the predictions/outcomes are persisted to. Our framework stack enables us to track models from use-case to deployments, manage and evaluate multiple models simultaneously in the live yet dark mode and continue to monitor models in production against real-world outcomes using configurable policies. The technologies/components which drive this vision are: 1. FeatureStore – Enables data scientists to reuse versioned features and review feature metrics by models. Self-Service capabilities allow all teams to onboard their events data into the feature store. 2. ModelRepository – Manages meta-data about models including pre-processing parameters (Ex. Scaling parameters for features), mapping to the features needed to execute the model, model discovery mechanisms, etc. 3. Spark on Alluxio – Alluxio provides the universal data plane on top of various under-stores (Ex. S3, HDFS, RDBMS). Apache Spark with its Data Sources API provides a unified query language which Data Scientist use to consume features to create training/validation/test datasets which are versioned and integrated into the full model pipeline using Ground-Context discussed next. 4. Ground-Context – This open-source vendor-neutral data context service enables full traceability from use-case, models, features, model to features mapping, versioned datasets, model training codebase, model deployment containers and prediction/outcome sinks. It integrates with the Feature-Store, Container Repository and Git to integrate data, code and run-time artifacts for CI/CD integration. Session hashtag: #ML6SAIS
We will discuss what feature engineering is all about , various techniques to use and how to scale to 20000 column datasets using random forest, svd, pca. Also demonstrated is how we can build a service around these to save time and effort when building 100s of models. We will share how we did all this using spark ml to build logistic regression, neural networks, Bayesian networks, etc. Session hashtag: #EUds12