Full stack data engineer and machine learning scientist with 8+ years working experience in Bing Data Mining, Bing Predicts, Business AI of C+E at Microsoft as an E2E applied scientist and data engineer. The technical areas spread across wide spectrum from Data Mining, Machine Learning, Market Intelligence using both Microsoft propriety technology (Cosmos, Azure Machine Learning) and industry leading Open Source technology (Spark, Databricks).
We demonstrate how to deploy a PySpark based Multi-class classification model trained on Azure Databricks using Azure Machine Learning (AML) onto Azure Kubernetes (AKS) and associate the model to web services. This presentation covers end-to-end development cycle; from training the model to using it in web application. Machine Learning problem formulation The current solutions to detect semantic types of tabular data mostly rely on dictionary/vocabulary, regular expressions and rule-based look up to identify the semantic types. However, these solutions are 1. Not robust to dirty and complex data and 2. Not generalized to diverse data types. We formulate this into a Machine Learning problem by training a multi-class classifier to automatically predict the semantic type for tabular data. Model Training on Azure Databricks We choose Azure Databricks to perform the featurization and model training using PySpark SQL and Machine Learning Library. To speed up the featurization process, we leverage the PySpark Functions (UDF) to register and distribute the featurization functions into UDFs. For the model training, we pick the Random Forests as the classification algorithms and optimize the model hyperparameters using PySpark MLLib. Model Deployment using Azure Machine Learning Azure Machine Learning provided the reusable and scalable capabilities to manager the lifecycle of Machine Learning models. We developed the E2E deployment pipeline on Azure Machine Learning including model preparation, computing initialization, model registration, and web service deployment. Serving as Web Service on Azure Kubernetes Azure Kubernetes provide the fast response and autoscaling capabilities serving model as web service together with the security authorization. We customized the AKS cluster with PySpark runtime to support PySpark based featurization and model scoring. Our model and scoring service are being deployed onto AKS cluster and served as HTTPS endpoints with both key-based and token-based authentication.