Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka - Databricks

Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka

At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.

In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include:

  • Kafka and Spark Streaming for stateless and stateful use-cases
  • Spark Structured Streaming as a possible alternative
  • Combining Spark Streaming with batch ETLs
  • ”Streaming” over Data Lake using Kafka


  • « back
About Itai Yaffe

Itai Yaffe is a Big Data Tech Lead at the Nielsen Marketing Cloud, where he deals with big data challenges using tools like Spark, Druid, Kafka, and others.  He is also a part of the Israeli chapter's core team of Women in Big Data. Itai is keen about sharing his knowledge and has presented his real-life experience in various forums in the past.