Massive Data Processing in Adobe Using Delta Lake

May 26, 2021 03:50 PM (PT)

Download Slides

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.

  • What are we storing?
    • Multi Source – Multi Channel Problem
  • Data Representation and Nested Schema Evolution
    • Performance Trade Offs with Various formats
      • Go over anti-patterns used 
        • (String FTW)
    • Data Manipulation using UDFs 
  • Writer Worries and How to Wipe them Away
  • Staging Tables FTW 
  • Datalake Replication Lag Tracking
  • Performance Time!
In this session watch:
Yeshwanth Vijayakumar, Sr Engineering Manager, Adobe, Inc.

 

Yeshwanth Vijayakumar

I am a Sr Engineering Manager/Architect on the Unified Profile Team in the Adobe Experience Platform; it’s a PB scale store with a strong focus on millisecond latencies and Analytical abilities and ...
Read more