Introducing the DLT Sink API: Write Pipelines to Kafka and External Delta Tables

Published: February 17, 2025

Summary

Data Estate Integration: New DLT Sinks allow seamless data flow to external systems like Kafka, Event Hubs, and Delta Tables.
Easy Configuration: The create_sink API simplifies setup for real-time pipelines with flexible options for Kafka and Delta.
Real-Time Use Cases: Examples show how to build pipelines for analytics, anomaly detection, and event-driven workflows.

If you are new to Delta Live Tables, prior to reading this blog we recommend reading Getting Started with Delta Live Tables, which explains how you can create scalable and reliable pipelines using Delta Live Tables (DLT) declarative ETL definitions and statements.

Introduction

Delta Live Tables (DLT) pipelines offer a robust platform for building reliable, maintainable, and testable data processing pipelines within Databricks. By leveraging its declarative framework and automatically provisioning optimal serverless compute, DLT simplifies the complexities of streaming, data transformation, and management, delivering scalability and efficiency for modern data workflows.

Traditionally, DLT Pipelines have offered an efficient way to ingest and process data as either Streaming Tables or Materialized Views governed by Unity Catalog. While this approach meets most data processing needs, there are cases where data pipelines must connect with external systems or need to use Structured Streaming sinks instead of writing to Streaming Tables or Materialized Views.

The introduction of new Sinks API in DLT addresses this by enabling users to write processed data to external event streams, such as Apache Kafka, Azure Event Hubs, as well as writing to a Delta Table. This new capability broadens the scope of DLT pipelines, allowing for seamless integration with external platforms.

These features are now in Public Preview and we will continue to add more sinks from Databricks Runtime to DLT over time, eventually supporting them all. The next one we are working on is foreachBatch which enables customers to write to arbitrary data sinks and perform custom merges into Delta tables.

The Sink API is available in the dlt Python package and can be used with create_sink() as shown below:

The API accepts three key arguments in defining the sink:

Sink Name: A string that uniquely identifies the sink within your pipeline. This name allows you to reference and manage the sink.
Format Specification: A string that determines the output format, with support for either "kafka" or "delta".
Sink Options: A dictionary of key-value pairs, where both keys and values are strings. For Kafka sinks, all configuration options available in Structured Streaming can be leveraged, including settings for authentication, partitioning strategies, and more. Please refer to the docs for a comprehensive list of Kafka-supported configuration options. Delta sinks offer a simpler configuration by allowing you to either define a storage path using the path attribute or write directly to a table in Unity Catalog using the tableName attribute.

Writing to a Sink

The @append_flow API has been enhanced to allow writing data into target sinks identified by their sink names. Traditionally, this API allowed users to seamlessly load data from multiple sources into a single streaming table. With the new enhancement, users can now append data to specific sinks too. Below is an example demonstrating how to set this up:

Building the pipeline

Let us now build a DLT pipeline that processes clickstream data, packaged within the Databricks datasets. This pipeline will parse the data to identify events linking to an Apache Spark page and subsequently write this data to both Event Hubs and Delta sinks. We will structure the pipeline using the Medallion Architecture, which organizes data into different layers to enhance quality and processing efficiency.

We start by loading raw JSON data into the Bronze layer using Auto Loader. Then, we clean the data and enforce quality standards in the Silver layer to ensure its integrity. Finally, in the Gold layer, we filter entries with a current page title of Apache_Spark and store them in a table named spark_referrers, which will serve as the source for our sinks. Please refer to the Appendix for the complete code.

Configuring the Azure Event Hubs Sink

In this section, we will use the create_sink API to establish an Event Hubs sink. This assumes that you have an operational Kafka or Event Hubs stream. Our pipeline will stream data into Kafka-enabled Event Hubs using a shared access policy, with the connection string securely stored in Databricks Secrets. Alternatively, you can use a service principal for integration instead of a SAS policy. Ensure that you update the connection properties and secrets accordingly. Here is the code to configure the Event Hubs sink:

Configuring the Delta Sink

In addition to the Event Hubs sink, we can utilize the create_sink API to set up a Delta sink. This sink writes data to a specified location in the Databricks File System (DBFS), but it can also be configured to write to an object storage location such as Amazon S3 or ADLS.

Below is an example of how to configure a Delta sink:

Creating Flows to hydrate Kafka and Delta sinks

With the Event Hubs and Delta sinks established, the next step is to hydrate these sinks using the append_flow decorator. This process involves streaming data into the sinks, ensuring they are continuously updated with the latest information.

For the Event Hubs sink, the value parameter is mandatory, while additional parameters such as key, partition, headers, and topic can be specified optionally. Below are examples of how to set up flows for both the Kafka and Delta sinks:

The applyInPandasWithState function is also now supported in DLT, enabling users to leverage the power of Pandas for stateful processing within their DLT pipelines. This enhancement allows for more complex data transformations and aggregations using the familiar Pandas API. With the DLT Sink API, users can easily stream this stateful processed data to Kafka topics. This integration is particularly useful for real-time analytics and event-driven architectures, ensuring that data pipelines can efficiently handle and distribute streaming data to external systems.

Bringing it all Together

The approach demonstrated above showcases how to build a DLT pipeline that efficiently transforms data while utilizing the new Sink API to seamlessly deliver the results to external Delta Tables and Kafka-enabled Event Hubs.

This feature is particularly valuable for real-time analytics pipelines, allowing data to be streamed into Kafka streams for applications like anomaly detection, predictive maintenance, and other time-sensitive use cases. It also enables event-driven architectures, where downstream processes can be triggered instantly by streaming events to Kafka topics, allowing fast processing of newly arrived data.

Call to Action

The DLT Sinks feature is now available in Public Preview for all Databricks customers! This powerful new capability lets you seamlessly extend your DLT pipelines to external systems like Kafka and Delta tables, ensuring real-time data flow and streamlined integrations. For more information, please refer to the following resources:

Databricks docs on DLT Sinks
Watch the demo on Kafka Sinks in DLT using Confluent Cloud
Python Language reference for DLT

Appendix:

Pipeline Code:

What's next?

November 20, 2024/4 min read

Introducing Predictive Optimization for Statistics

November 21, 2024/3 min read