James Nguyen is a Principal Cloud Solution Architect at Microsoft’s Azure Customer Success Organization. He has a master’s degree in Data Science from UC Berkeley California. He mainly focuses on Big Data and Machine Learning. James has delivered multiple successful large scale implementations in advanced analytics for Microsoft’s strategic customers. He is passionate about scaling Data Science with the power of Apache Spark.
To scale out deep learning training, a popular approach is to use Distributed Deep Learning Frameworks to parallelize processing and computation across multiple GPUs/CPUs. Distributed Deep Learning Frameworks work well when input training data elements are independent, allowing parallel processing to start immediately. However preprocessing and featurization steps, crucial to Deep Learning development, might involve complex business logic with computations across multiple data elements that the standard Distributed Frameworks cannot handle efficiently. These preprocessing and featurization steps are where Spark can shine, especially with the upcoming support in version 3.0 for binary data formats commonly found in Deep Learning applications. The first part of this talk will cover how Pandas UDFs together with Spark’s support for binary data and Tensorflow’s TFRecord formats can be used to efficiently speed up Deep Learning’s preprocessing and featurization steps. For the second part, the focus will be techniques to efficiently perform batch scoring on large data volume with Deep Learning models where real-time scoring methods do not suffice. Upcoming Spark 3.0’s new Pandas UDFs’ features helpful for Deep Learning inference will be covered.