Today, we are excited to announce the public preview of data lineage in Unity Catalog, available on AWS and Azure.
In the previous announcement blog, we discussed how teams can leverage data lineage in a lakehouse as a powerful tool for effective data governance. In this blog, we explore some of the key features in this release, how to get started capturing data lineage with Unity Catalog, and a sneak peek into our upcoming roadmap for lineage.
Unity Catalog, now generally available on AWS and Azure, provides a unified governance solution for all data and AI assets in your lakehouse on any cloud. With automated data lineage in Unity Catalog, data teams can now track sensitive data for compliance requirements, ensure data quality, and perform impact analysis of any data changes across the lakehouse. Lineage is aggregated across all workspaces attached to a Unity Catalog metastore. This means that lineage captured in one workspace is visible in any other workspace sharing that metastore.
Key data lineage features available with the public preview
Lineage for all workloads in any language: Unity Catalog automatically tracks data lineage across queries executed in any language (Python, SQL, R, and Scala) and execution mode (batch and streaming). The lineage graphs are displayed in real time with just a few clicks.
Lineage for notebooks, workflows, and dashboards: Unity Catalog also captures lineage for notebooks, workflows, and dashboards. This helps with end-to-end visibility into how data is used in your organization and understanding the impact of any data changes on downstream consumers.
Built-in security: Lineage graphs leverage the common permission model in Unity Catalog. Users must have the correct permissions to view the lineage data, adding an additional layer of security and minimizing the risk of any data breaches. If users do not have the SELECT privilege on a table, they will not be able to explore the lineage associated with that table. Additionally, users can see lineage information only for notebooks, workflows, and dashboards they have permission to view.
Column-level granularity: The Unity Catalog captures data lineage for tables, views, and columns. This information gives data teams a granular view of how data flows both upstream and downstream from a particular table or column in the lakehouse with just a few clicks.
Easily exportable via REST API: Lineage information can be retrieved via REST API to support integrations with other data catalogs and governance solutions.
Getting started with data lineage in Unity Catalog
Watch the demo below to learn more about data lineage capability in Unity Catalog.
Data lineage is available with Databricks Premium and Enterprise tiers for no additional cost. If you already are a Databricks customer, follow the data lineage guides (AWS | Azure) to get started. If you are not an existing Databricks customer, sign up for a free trial with a Premium or Enterprise workspace.
What’s coming next
This is just the beginning, we are working on exciting new features to realize our vision for seamless data observability and data quality with data lineage in a lakehouse.
Lineage for files: Trace lineage back to files in cloud storage - especially useful for first-mile ETL use cases.
In-context lineage: View and action lineage where it is most relevant - for example, view lineage for a specific workflow to quickly understand the impact of failures.
Lineage as system tables: Programmatically access predefined system tables to query lineage data using your favorite language.