DLT offers a robust platform for building reliable, maintainable, and testable data processing pipelines within Databricks. By leveraging its declarative framework and automatically provisioning optimal serverless compute, DLT simplifies the complexities of streaming, data transformation, and management, delivering scalability and efficiency for modern data workflows.
We’re excited to announce a much-anticipated enhancement: the ability to publish tables to multiple schemas and catalogs within a single DLT pipeline. This capability reduces operational complexity, lowers costs, and simplifies data management by allowing you to consolidate your medallion architecture (Bronze, Silver, Gold) into a single pipeline while maintaining organizational and governance best practices.
With this enhancement, you can:
LIVE
syntax to denote dependencies between tables. Fully and partially qualified table names are supported, along with USE SCHEMA
and USE CATALOG
commands, just like in standard SQL.“The ability to publish to multiple catalogs and schemas from one DLT pipeline - and no longer requiring the LIVE keyword - has helped us standardize on pipeline best practices, streamline our development efforts, and facilitate the easy transition of teams from non-DLT workloads to DLT as part of our large-scale enterprise adoption of the tooling.”— Ron DeFreitas, Principal Data Engineer, HealthVerity
All pipelines created from the UI now default to supporting multiple catalogs and schemas. You can set a default catalog and schema at the pipeline level through the UI, the API, or Databricks Asset Bundles (DABs).
If you are creating a pipeline programmatically, you can enable this capability by specifying the schema
field in the PipelineSettings
. This replaces the existing target
field, ensuring that datasets can be published across multiple catalogs and schemas.
To create a pipeline with this capability via API, you can follow this code sample (Note: Personal Access Token authentication must be enabled for the workspace):
By setting the schema
field, the pipeline will automatically support publishing tables to multiple catalogs and schemas without requiring the LIVE
keyword.
schema
field in the pipeline YAML and remove the target
field if it exists.
databricks bundle validate
“ to validate that the DAB configuration is valid.databricks bundle deploy -t <environment>
“ to deploy your first DPM pipeline!“The feature works just like we expect it to work! I was able to split up the different datasets within DLT into our stage, core and UDM schemas (basically a bronze, silver, gold setup) within one single pipeline.”— Florian Duhme, Expert Data Software Developer, Arvato
Once your pipeline is set up, you can define tables using fully or partially qualified names in both SQL and Python.
SQL Example
Python Example
You can reference datasets using fully or partially qualified names, with the LIVE keyword being optional for backward compatibility.
SQL Example
Python Example
With this new capability, key API methods have been updated to support multiple catalogs and schemas more seamlessly:
Previously, these methods could only reference datasets defined within the current pipeline. Now, they can reference datasets across multiple catalogs and schemas, automatically tracking dependencies as needed. This makes it easier to build pipelines that integrate data from different locations without additional manual configuration.
In the past, these methods required explicit references to external datasets, making cross-catalog queries more cumbersome. With the new update, dependencies are now tracked automatically, and the LIVE schema is no longer required. This simplifies the process of reading data from multiple sources within a single pipeline.
Databricks SQL syntax now supports setting active catalogs and schemas dynamically, making it easier to manage data across multiple locations.
SQL Example
Python Example
This feature also allows pipeline owners to publish event logs in the Unity Catalog metastore for improved observability. To enable this, specify the event_log
field in the pipeline settings JSON. For example:
With that, you can now issue GRANTS on the event log table just like any regular table:
You can also create a view over the event log table:
Besides all of the above, you are also able to stream from the event log table:
Looking ahead, these enhancements will become the default for all newly created pipelines, whether created via UI, API, or Databricks Asset Bundles. Additionally, a migration tool will soon be available to help transition existing pipelines to the new publishing model.
Read more in the documentation here.