James Nguyen - Databricks

James Nguyen

Principal Cloud Solution Architect, Microsoft

James Nguyen is a Principal Cloud Solution Architect in ML and Big Data at Microsoft Azure Customer Success Unit. James has more than 14 years of experience in implementing large scale enterprise solutions building digital platforms at Microsoft, start-up and other consulting companies he worked for. James earned his master’s degree in Data Science from UC Berkeley California. His current focus is implementing Big Data and ML Operationalization for Microsoft’s customers. At Microsft, James has been working in some of biggest Spark implementations (T-Mobile, Starbucks, Providence) together with Databricks


Leveraging Apache Spark for Large Scale Deep Learning Data Preparation and InferenceSummit 2020

While it has been known that training a Deep Learning model requires lots of data to produce good result, rapidly growing business data often requires deployed Deep Learning model to be able to process larger and larger dataset. It is not uncommon nowadays that Deep Learning practitioners find themselves operating in a big data world. To solve the problem with a large dataset in training, distributed Deep Learning frameworks were introduced. At the inference side, machine learning models, particularly and deep learning models are usually deployed as Rest API endpoints and scalability is achieved by replicating the deployment across multiple nodes in frameworks such as Kubernetes. These mechanisms usually requires a lot of engineering effort to set up correctly and is not always efficient, especially in very big data volume. In this article, I'd like to present two technical approaches to address the two challenges of Deep Learning in Big data:

  1. Parallelize large volume data preprocessing for structured and unstructured data
  2. Deploy Deep Learning Model for high-performance batch scoring in big data pipeline with Spark. The approaches leverages latest features and enhancements in Spark Framework and Tensorflow 2.0.