Yeshwanth Vijayakumar

Sr Engineering Manager, Adobe, Inc.

I am a Sr Engineering Manager/Architect on the Unified Profile Team in the Adobe Experience Platform; it’s a PB scale store with a strong focus on millisecond latencies and Analytical abilities and easily one of Adobe’s most challenging SaaS projects in terms of scale. I am actively designing/implementing the Interactive segmentation capabilities which helps us segment over 2 million records per second using Apache Spark. I look for opportunities to build new features using interesting data Structures and Machine Learning approaches. In a previous life, I was a ML Engineer on the Yelp Ads team building models for Snippet Optimizations.

Past sessions

Summit 2021 Massive Data Processing in Adobe Using Delta Lake

May 26, 2021 03:50 PM PT

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.

  • What are we storing?
    • Multi Source - Multi Channel Problem
  • Data Representation and Nested Schema Evolution
    • Performance Trade Offs with Various formats
      • Go over anti-patterns used 
        • (String FTW)
    • Data Manipulation using UDFs 
  • Writer Worries and How to Wipe them Away
  • Staging Tables FTW 
  • Datalake Replication Lag Tracking
  • Performance Time!
In this session watch:
Yeshwanth Vijayakumar, Sr Engineering Manager, Adobe, Inc.

[daisna21-sessions-od]

Summit 2021 How Adobe uses Structured Streaming at Scale

May 26, 2021 04:25 PM PT

Adobe's Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing. We want to share some of our learnings and hard earned lessons and as we reached this scale specifically with Structured Streaming.

 

Know thy Lag

  • While consuming off a Kafka topic which sees sporadic loads, its very important to monitor the Consumer lag. Also makes you respect what a beast backpressure is.

Reading Data In

  • Fan Out Pattern using minPartitions to Use Kafka Efficiently
  • Overload protection using maxOffsetsPerTrigger
  • More Apache Spark Settings used to optimize Throughput

MicroBatching Best Practices

  • Map() +ForEach() vs MapPartitons + forEachPartition

Adobe Spark Speculation and its Effects

Calculating Streaming Statistics

  • Windowing
    • Importance of  the State Store
    • RocksDB FTW
  • Broadcast joins
  • Custom Aggegators
  • OffHeap Counters using Redis
    • Pipelining
In this session watch:
Yeshwanth Vijayakumar, Sr Engineering Manager, Adobe, Inc.

[daisna21-sessions-od]

Summit 2021 Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

May 28, 2021 11:05 AM PT

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.

 

Niche 1 : Long Running Spark Batch Job - Dispatch New Jobs by polling a Redis Queue

· Why? 

o Custom queries on top a table; We load the data once and query N times

· Why not Structured Streaming

· Working Solution using Redis

 

Niche 2 : Distributed Counters

· Problems with Spark Accumulators

· Utilize Redis Hashes as distributed counters

· Precautions for retries and speculative execution

· Pipelining to improve performance

In this session watch:
Yeshwanth Vijayakumar, Sr Engineering Manager, Adobe, Inc.

[daisna21-sessions-od]

Summit 2020 Everyday Probabilistic Data Structures for Humans

June 24, 2020 05:00 PM PT

Processing large amounts of data for analytical or business cases is a daily occurrence for Apache Spark users. Cost, Latency and Accuracy are 3 sides of a triangle a product owner has to trade off. When dealing with TBs of data a day and PBs of data overall, even small efficiencies have a major impact on the bottom line. This talk is going to talk about practical application of the following 4 data-structures that will help design an efficient large scale data pipeline while keeping costs at check.

  1. Bloom Filters
  2. Hyper Log Log
  3. Count-Min Sketches
  4. T-digests (Bonus)

We will take the fictional example of an eCommerce company Rainforest Inc and try to answer the business questions with our PDT and Apache Spark and and not do any SQL for this.

  1. Has User John seen an Ad for this product yet?
  2. How many unique users bought Items A , B and C
  3. Who are the top Sellers today?
  4. Whats the 90th percentile of the cart Prices? (Bonus)

We will dive into how each of these data structures are calculated for Rainforest Inc and see what operations and libraries will help us achieve our results. The session will simulate a TB of data in a notebook (streaming) and will have code samples showing effective utilizations of the techniques described to answer the business questions listed above. For the implementation part we will implement the functions as Structured Streaming Scala components and Serialize the results to be queried separately to answer our questions. We would also present the cost and latency efficiencies achieved at the Adobe Experience Platform running at PB Scale by utilizing these techniques to showcase that it goes beyond theory.

Adobe's Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing. We want to share some of our learnings and hard earned lessons and as we reached this scale.

  • Repeated Queries Optimization - or the Art of How I learned to cache my physical Plans. SQL interfaces expose prepared statements , how do we use the same analogy for batch processing?
  • Know thy Join - Joins/Group By are unavoidable when you don't have much control over the data model, But one must know what exactly happens underneath given the deadly shuffle that one might encounter.
  • Structured Streaming - Know thy Lag - While consuming off a Kafka topic which sees sporadic loads, its very important to monitor the Consumer lag. Also makes you respect what a beast backpressure is.
  • Skew! Phew! - Skewed data causes so many uncertainties especially at runtime. Configs which applied on day zero no longer apply on day 100. The code must be made resilient to Skewed datasets.
  • Sample Sample Sample - Sometimes the best way to approach a large problem is to eat a small part of it first.
  • Redis - Sometimes the best tool for the job is actually outside your JVM. Pipelining + Redis is a powerful combination to supercharge your data pipeline.

We will present our war stories and lessons for the above and hopefully will benefit the broader community.