Jose is a software engineer working on the Spark execution engine and Delta Lake. He holds a bachelor’s degree in computer science from Caltech.
May 28, 2021 11:05 AM PT
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
June 4, 2018 05:00 PM PT
This talk will cover the details of Continuous Processing in Structured Streaming and my work implementing the initial version in Spark 2.3 as well as the updates for 2.4. DStreams was Spark's first attempt at streaming, and through dstream Spark became the first framework to provide both batch and streaming functionalities in one unified execution engine.
The way streaming execution happens is through this "micro-batch" model, in which the underlying execution engine simply runs on batches of data over and over again. Dstream's design tightly couples the user-facing APIs with the execution model, and as a result was very difficult to accomplish certain tasks important in streaming, e.g. using event time and working with late data, without breaking the user-facing APIs. Structured Streaming was the 2nd (and the latest) major streaming effort in Spark. Its design decouples the frontend (user-facing APIs) and backend (execution), and allows us to change the execution model without any user API change.
However, the (historical) minimum possible latency for any record for DStreams or Structured Streaming was bounded by the amount of time that it takes to launch a task. This limitation is a result of the fact that the engine requires us to know both the starting and the ending offset, before any tasks are launched. In the worst case, the end-to-end latency is actually closer to the average batch time + task launching time. Continuous Processing removes this constraints and allows users to achieve sub-millisecond end-to-end latencies with the new execution engine.
This talk will take a technical deep dive into its capabilities, what it took to implement, and discuss the future developments.
Session hashtag: #Dev4SAIS