How We Built an Event-Time Merge of Two Kafka-Streams with Spark Streaming

Download Slides

Our team handles the tracking data of otto.de which arrives in multiple, currently two, Kafka streams. We developed a Spark microservice which utilizes a custom DStream and the concept of a minimum business clock to do an event time merge of the two streams. After a short introduction of the problem and what other options we considered we’ll present the solutions we developed. We customized the updateStateByKey DStream to be able to introduce a minimum business clock for each partition of the streams. This enabled us to do event time merges and time outs for the data from both streams. We want to demonstrate our solution and think that it will also be useful for other developers who have to solve the same or similar problems. In addition to our solution we will explain why we used a custom DStream instead of for example the built in windowing mechanism.

About Sebastian Schröder

Sebastian graduated from University of Applied Sciences Wedel, Germany in 2012 with a Master of Science. Since then he has been working as a developer for otto.de where he helped rebuilding the webshop to a vertical, shared nothing architecture. Two years ago he joined the tracking team which uses Scala, Kafka and Spark Streaming to handle the massive number of requests generated by the users.

About Ralf Sigmund

Ralf graduated from University of Hannover, Germany with a PhD in Biochemistry. Since then he worked in numerous data processing scenarios (Bioinformatics, Federal Registers). Three years ago he joined the otto.de tracking team to organize near realtime processing and analytics.