Kaarthik Sivashanmugam

Principal Engineering Manager, Microsoft

Kaarthik works in the AI Platform group at Microsoft. He has expertise in building data and machine learning platforms. In his current role, he is building a distributed deep learning platform to unlock the full potential of GPU cloud, data and machine learning techniques in addressing complex AI challenges and enabling magical end-user experiences in various Microsoft services. Kaarthik is also involved in making Azure Machine Learning service the best cloud-platform for data scientists and ML engineers.



Infrastructure for Deep Learning in Apache SparkSummit 2019

In machine learning projects, the preparation of large datasets is a key phase which can be complex and expensive. It was traditionally done by data engineers before the handover to data scientists or ML engineers. They operated in different environments due to the differences in the tools, frameworks and runtimes required in each phase. Spark's support for different types of workloads brought data engineering closer to the downstream activities like machine learning that depended on the data.

Unifying data acquisition, preprocessing, training models and batch inferencing under a single platform enabled by Spark not only provided seamless experience between different phases and helped accelerate the end-to-end ML lifecycle but also lowered the TCO in the building, managing the infrastructure to cover different phases. With that, the needs of a shared infrastructure expanded to include specialized hardware like GPUs and support deep learning workloads as well.Spark can effectively make use of such infrastructure as it integrates with popular deep learning frameworks and supports acceleration of deep learning jobs using GPUs.

In this talk, we share learnings and experiences in supporting different types of workloads in shared clusters equipped for doing deep learning as well as data engineering.

We will cover the following topics:

  • Considerations for sharing the infrastructure for big data and deep learning in Spark
  • Deep learning in Spark in clusters with and without GPUs
  • Differences between distributed data processing and distributed machine learning
  • Multitenancy and isolation in shared infrastructure