Project Lead/Architect, Adobe, Inc.
I am a Project Lead/Architect on the Unified Profile Team in the Adobe Experience Platform; it’s a PB scale store with a strong focus on millisecond latencies and Analytical abilities and easily one of Adobe’s most challenging SaaS projects in terms of scale. I am actively designing/implementing the Interactive segmentation capabilities which helps us segment over 2 million records per second using Apache Spark. I look for opportunities to build new features using interesting data Structures and Machine Learning approaches. In a previous life, I was a ML Engineer on the Yelp Ads team building models for Snippet Optimizations.
Processing large amounts of data for analytical or business cases is a daily occurrence for Apache Spark users. Cost, Latency and Accuracy are 3 sides of a triangle a product owner has to trade off. When dealing with TBs of data a day and PBs of data overall, even small efficiencies have a major impact on the bottom line. This talk is going to talk about practical application of the following 4 data-structures that will help design an efficient large scale data pipeline while keeping costs at check.
- Bloom Filters
- Hyper Log Log
- Count-Min Sketches
- T-digests (Bonus)
We will take the fictional example of an eCommerce company Rainforest Inc and try to answer the business questions with our PDT and Apache Spark and and not do any SQL for this.
- Has User John seen an Ad for this product yet?
- How many unique users bought Items A , B and C
- Who are the top Sellers today?
- Whats the 90th percentile of the cart Prices? (Bonus)
We will dive into how each of these data structures are calculated for Rainforest Inc and see what operations and libraries will help us achieve our results. The session will simulate a TB of data in a notebook (streaming) and will have code samples showing effective utilizations of the techniques described to answer the business questions listed above. For the implementation part we will implement the functions as Structured Streaming Scala components and Serialize the results to be queried separately to answer our questions. We would also present the cost and latency efficiencies achieved at the Adobe Experience Platform running at PB Scale by utilizing these techniques to showcase that it goes beyond theory.
Deployment of modern machine learning applications can require a significant amount of time, resources, and experience to design and implement â€“ thus introducing overhead for small-scale machine learning projects. In this tutorial, we present a reproducible framework for quickly jumpstarting data science projects using Databricks and Azure Machine Learning workspaces that enables easy production-ready app deployment for data scientists in particular. Although the example presented in the session focuses on deep learning, the workflow can be extended to other traditional machine learning applications as well. The tutorial will include sample-code with templates and recommended project organization structure and tools, along with shared key learnings from our experiences in deploying machine learning pipelines into production and distributing a repeatable framework within our organization.
What you will learn:
- Understand how to develop pipelines for continuous integration and deployment within Azure Machine Learning using Azure Databricks.
- Learn how to execute Apache Spark jobs using Databricks Connect and integrating source code with Azure DevOps for version control.
- Exposure to using Apache Spark and Koalas for extracting and preprocessing data for modeling.
- Hands-on experience building deep learning models for time series classification.
- Address challenges of the ML lifecycle by implementing MLflow for tracking model. parameters/results, packaging code for reproducibility, and deploying models.
- Microsoft Azure Account, Azure Machine Learning Workspace
- Azure DevOps Configured Pre-Register for a Databricks Standard Trial (runtime > 6.0)
- Python 3.7.1 virtual environment with the following libraries o databricks-connect==6.1.
- More will be added later.
- Basic knowledge of Python
- Apache SparBasic understanding of Deep Learning Concepts