Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

Download Slides

Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. If your data is already in one of Spark Streaming’s well-supported message queuing systems, this is easy. If not, an ad hoc solution to import data may work for a single application, but trying to scale that approach to complex data pipelines integrating dozens of data sources and sinks with multi-stage processing quickly breaks down. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges.

The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier. This talk will first describe some data pipeline anti-patterns we have observed and motivate the need for a tool designed specifically to bridge the gap between other data systems and stream processing frameworks. We will introduce Kafka Connect, starting with basic usage, its data model, and how a variety of systems can map to this model. Next, we’ll explain how building a tool specifically designed around Kafka allows for stronger guarantees, better scalability, and simpler operationalization compared to other general purpose data copying tools. Finally, we’ll describe how combining Kafka Connect and Spark Streaming, and the resulting separation of concerns, allows you to manage the complexity of building, maintaining, and monitoring large scale data pipelines.

Learn more:

  • Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2
  • Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Structured Streaming
  • Databricks’ Data Pipelines: Journey And Lessons Learned

    « back
  • About Ewen Cheslack-Postava

    Ewen Cheslack-Postava is an engineer at Confluent, building a stream data platform based on Apache Kafka to help organizations reliably and robustly capture and leverage all their real-time data. He received his PhD from Stanford University where he developed Sirikata, an open source system for massive virtual environments. His dissertation defined a novel type of spatial query giving significantly improved visual fidelity, and described a system for efficiently processing these queries at scale.