How Adobe Does 2 Million Records Per Second Using Apache Spark! - Databricks

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Deployment of modern machine learning applications can require a significant amount of time, resources, and experience to design and implement – thus introducing overhead for small-scale machine learning projects. In this tutorial, we present a reproducible framework for quickly jumpstarting data science projects using Databricks and Azure Machine Learning workspaces that enables easy production-ready app deployment for data scientists in particular. Although the example presented in the session focuses on deep learning, the workflow can be extended to other traditional machine learning applications as well. The tutorial will include sample-code with templates and recommended project organization structure and tools, along with shared key learnings from our experiences in deploying machine learning pipelines into production and distributing a repeatable framework within our organization.

What you will learn:

  • Understand how to develop pipelines for continuous integration and deployment within Azure Machine Learning using Azure Databricks.
  • Learn how to execute Apache Spark jobs using Databricks Connect and integrating source code with Azure DevOps for version control.
  • Exposure to using Apache Spark and Koalas for extracting and preprocessing data for modeling.
  • Hands-on experience building deep learning models for time series classification.
  • Address challenges of the ML lifecycle by implementing MLflow for tracking model. parameters/results, packaging code for reproducibility, and deploying models.

Prerequisites:

  • Microsoft Azure Account, Azure Machine Learning Workspace
  • Azure DevOps Configured Pre-Register for a Databricks Standard Trial (runtime > 6.0)
  • Python 3.7.1 virtual environment with the following libraries o databricks-connect==6.1.
  • koalas==0.23.0
  • pandas==0.25.3
  • keras==2.3.1
  • mlflow==1.4.0
  • More will be added later.
  • Basic knowledge of Python
  • Apache SparBasic understanding of Deep Learning Concepts


« back
About Yeshwanth Vijayakumar

Adobe, Inc.

I am a Project Lead/Architect on the Unified Profile Team in the Adobe Experience Platform; it's a PB scale store with a strong focus on millisecond latencies and Analytical abilities and easily one of Adobe's most challenging SaaS projects in terms of scale. I am actively designing/implementing the Interactive segmentation capabilities which helps us segment over 2 million records per second using Apache Spark. I look for opportunities to build new features using interesting data Structures and Machine Learning approaches. In a previous life, I was a ML Engineer on the Yelp Ads team building models for Snippet Optimizations.