Designing Structured Streaming Pipelines—How to Architect Things Right

Download Slides

Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem needs to be solved.

  • What are you trying to consume? Single source? Joining multiple streaming sources? Joining streaming with static data?
  • What are you trying to produce? What is the final output that the business wants? What type of queries does the business want to run on the final output?
  • When do you want it? When does the business want to the data? What is the acceptable latency? Do you really want to millisecond-level latency?
  • How much are you willing to pay for it? This is the ultimate question and the answer significantly determines how feasible is it solve the above questions.
  • These are the questions that we ask every customer in order to help them design their pipeline. In this talk, I am going to go through the decision tree of designing the right architecture for solving your problem.

     

    Try Databricks
    See More Spark + AI Summit in San Francisco 2019 Videos


    « back
    About Tathagata Das

    Tathagata Das is an Apache Spark committer and a member of the PMC. He's the lead developer behind Spark Streaming and currently develops Structured Streaming. Previously, he was a grad student in the UC Berkeley at AMPLab, where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica.