What’s New with Databricks Unity Catalog at Data + AI Summit 2024
In an era marked by rapid advancements in artificial intelligence and an explosion of data and Gen AI tools, enterprises face fragmented data and AI governance, impeding their efforts to democratize data and AI. To thrive in this era, enterprises must adopt an open and unified approach to data and AI governance. This entails:
- Open Connectivity: Creating a single, reliable source of truth for all their data, regardless of its origin or format.
- Unified Governance: Implementing comprehensive oversight so that all data (files, tables) and AI assets (ML models, AI tools, notebooks) are discovered, secured, monitored, and tracked in a central system.
- Open Accessibility: Providing the flexibility to access data and AI resources from any tool, compute engine, or platform using open standards and interfaces to avoid lock-in.
This unified and open approach to governance is fundamental to building a robust Data Intelligence Platform. Three years ago, Databricks pioneered this approach by releasing Unity Catalog, the industry's only unified governance solution for data and AI across clouds, data formats, and data platforms. It is designed to scale securely and compliantly for both BI and Gen AI use cases. Over 10,000+ enterprises are now leveraging Unity Catalog to govern their data and AI estate.
We are excited to announce cutting-edge advancements to further enhance these capabilities across Open Accessibility, Open Connectivity, and Unified Governance.
Open Accessibility - Access data and AI resources from any compute engine, tool or platform
Open sourcing Unity Catalog: The Industry's only universal catalog for data and AI
We are excited to announce that we are open-sourcing Unity Catalog. This initiative underscores Databricks’ commitment to an open ecosystem, providing customers with the flexibility and control they need without being tied to a single vendor. This is a joint effort with Amazon Web Services, Microsoft Azure, Google Cloud, Nvidia, Salesforce, DuckDB, LangChain, dbt Labs, Fivetran, Confluent, Unstructured, Onehouse, Immuta, Informatica and many more.
Today, we are releasing version 0.1 of open source Unity Catalog. While some of our APIs and features will still be evolving, this release showcases several important capabilities of Unity Catalog:
- Tables, Volumes (unstructured data), and AI Tools/Functions can be managed together.
- Tables can be in multiple formats, including Delta Lake, Iceberg via UniForm, Parquet, CSV, and JSON.
- Unity Catalog implements the Iceberg REST Catalog API for access from the Iceberg engine ecosystem, leveraging expertise from Tabular.
- The API supports credential vending to gate clients’ access to the underlying cloud storage for tables and volumes, centralizing governance in the catalog server.
If you are already a Databricks customer, there is nothing you need to do differently. Customers’ existing Unity Catalog deployments implement the same open APIs – enabling external clients to read from all tables (including managed and external tables), volumes, and functions in hosted Unity Catalog from Day 1, with your existing access controls in place. This change simply means a larger ecosystem of clients will work with your existing catalog.
Unity REST APIs enable our partners and the open source community to build powerful integrations that will enable customers to work on their tables, unstructured data, and AI tools/functions from diverse applications, with no external access fees.
Join the Unity Catalog OSS community at unitycatalog.io and start developing with Unity Catalog by visiting our GitHub repository.
“AT&T is committed to making our data interoperable with our platforms. With the announcement of Unity Catalog's open sourcing, we are encouraged by Databricks' step to make lakehouse governance and metadata management possible through open standards. The flexibility to utilize interoperable tools with our data and AI assets, with consistent governance, is core to the AT&T data platform strategy.”— Matt Dugan, VP Data Platforms, AT&T
“AWS welcomes Databricks’ move to open source Unity Catalog. AWS is committed to working with the industry on open source solutions that enable choice and interoperability for customers.”— Chris Grusz, Managing Director of Technology Partnerships, AWS
Unified Governance - Across Data and AI
Lakehouse Monitoring: Profiling, diagnosing, and enforcing data quality with intelligence
We are also excited to announce the General Availability of Databricks Lakehouse Monitoring, available on AWS | Azure. Our unified approach to monitoring data and AI allows you to easily profile, diagnose, and enforce quality directly in the Databricks Data Intelligence Platform.
Lakehouse Monitoring simplifies the process for data teams by providing automated profiling and a dashboard that visualizes trends and anomalies over time, without requiring any additional tools or added complexity. By tracking key metrics such as data volume, percent nulls, numerical distribution changes, and categorical distribution over time, Lakehouse Monitoring provides insights and identifies problematic columns early on. For inference tables, you can monitor model drift and performance metrics like accuracy, F1 score, precision, and recall to determine when retraining is needed. With a proactive approach to quality, teams can discover issues before business operations are impacted.
“Lakehouse Monitoring has been a game changer. It helps us solve the issue of data quality directly in the platform. It's like the heartbeat of the system. Our data scientists are excited they can finally understand data quality without having to jump through hoops."— Yannis Katsanos, Director of Data Science, Ecolab
Attribute-Based Access Controls - Scalable access management for data and AI
We are pleased to announce Private Preview of Attribute-Based Access Control (ABAC) in Unity Catalog. ABAC offers organizations a high-leverage governance solution that simplifies the enforcement of governance policies across their entire lakehouse. By employing straightforward rules and tags, ABAC ensures consistent governance across all data sources, whether native to Databricks or federated from external sources. Its flexibility extends to the ease of defining and managing access policies, providing users with intuitive options such as the policy builder UI, SQL queries, and APIs. Moreover, Databricks ABAC seamlessly integrates with third-party governance tools, enhancing its interoperability and allowing organizations to leverage existing investments in governance infrastructure.
With ABAC, users can establish access controls tailored to specific attributes of resources like workspaces, data assets such as tables, and AI assets. These attributes encompass a wide range of parameters, including user-defined tags, workspace details, location, identity, and time. Whether it's ensuring sensitive data remains restricted to authorized personnel or dynamically adjusting access based on changing project requirements, ABAC empowers users to enforce security measures with granular precision.
Announcing Unity Catalog Metrics - Governed business metrics for data and AI
We are also introducing Unity Catalog Metrics, enabling data teams to make better business decisions using certified metrics, defined in the lakehouse and accessible via Databricks (e.g, SQL, Notebooks, AI/BI Dashboards and AI/BI Genie spaces) and third party BI tools (e.g., Tableau, Power BI).
Data is often spread across multiple systems and departments, leading to varying definitions of key business metrics among different teams. This inconsistency can cause confusion and misaligned reporting. By standardizing metric definitions, Unity Catalog Metrics allows data teams to work with the same semantics and underlying data, ensuring that all teams use consistent definitions. This promotes trust and reliability in the data.
Unity Catalog Metrics is built on top of your existing lakehouse resources, such as tables and files, and acts as an intermediary between your data sources and data consumers. This new Unity Catalog asset is fully governed and discoverable in Unity Catalog like any other resource and provides complete lineage visibility. With an open approach, users can access these metrics from all Databricks interfaces, including AI/BI Dashboards, AI/BI Genie, Databricks SQL, data science and machine learning tools like notebooks, and any third-party BI tools such as Power BI, Tableau, Looker and more. These metrics are fully SQL-addressable and support integration with third-party metrics tools such as dbt Labs, Cube, and AtScale, ensuring seamless integration and comprehensive data analysis capabilities.
Keep an eye out for more updates on this capability in Unity Catalog!
Open Connectivity- Any data, any format, any source
Lakehouse Federation: Discover, query, and govern any data, no matter where it lives
We’re excited to announce that Lakehouse Federation in Unity Catalog will soon be generally available. Lakehouse Federation offers a unified data management, discovery, and governance experience across multiple platforms, including MySQL, PostgreSQL, Amazon Redshift, Snowflake, Azure SQL Database, Azure Synapse, Google BigQuery, and more, all within Databricks. Unity Catalog extends its advanced security features, like row and column level access controls, and discovery tools, such as tags and data lineage, to these external data sources, ensuring consistent governance practices.
The upcoming General Availability release will include connector support for MySQL, PostgreSQL, Amazon Redshift, Snowflake, Azure SQL Database, Azure Synapse, and Google BigQuery (Preview). It will also enhance pushdown coverage and performance for Snowflake, SQL Server, Postgres, Redshift, and Synapse, with OAuth support for Snowflake connections and Azure AD support for Azure ecosystem connections. Additionally, the release will offer case-sensitive namespace support and introduce a Salesforce Data Cloud Connector (Preview).
We’re also extending Lakehouse Federation to Apache Hive and AWS Glue, with a preview coming soon.
“Lakehouse Federation allows us to bring other data sources into Unity Catalog much quicker as we transition to the target architecture.”— Bryce Bartmann, Chief Digital Technology Advisor, Shell
Getting started with Unity Catalog
By embracing Unity Catalog as the cornerstone of your Lakehouse architecture, you can unlock the power of a flexible and scalable governance implementation that spans your entire data and AI estate. To get started, follow the Unity Catalog guides available for AWS, Azure, and GCP.
Watch the Data+AI Summit 2024 keynote from Matei Zaharia, Co-founder and Chief Technology Officer at Databricks, to learn more about these recent announcements. Register for Data + AI Summit and explore the top data and AI governance sessions.
Download the free eBook on how to build an effective governance strategy for data and AI.