What’s new with Unity Catalog at Data and AI Summit 2023
The fundamental principles of governance – accountability, compliance, quality, and transparency – that are essential for data management have now become equally imperative for AI. Databricks took a pioneering approach with Unity Catalog by releasing the industry's only unified solution for data and AI governance across clouds and data platforms.
Organizations can use Unity Catalog to securely discover, access, monitor and collaborate on files, tables, ML models, notebooks and dashboards across any data platform or cloud, while also leveraging AI to boost productivity and unlock the full potential of the lakehouse environment.
We are excited to announce cutting-edge advancements in Unity Catalog including Lakehouse Federation, Governance for AI, AI-powered Governance (Lakehouse Monitoring, Lakehouse Observability), and many more.
Lakehouse Federation: Discover, govern and query your data wherever it lives
Lakehouse Federation in Unity Catalog enables organizations to build an open, performant, and secure data mesh architecture. With Lakehouse Federation, organizations can leverage a consistent data management, discovery, and governance experience for all their data across various platforms, including MySQL, PostgreSQL, Amazon Redshift, Snowflake, Azure SQL Database, Azure Synapse, Google BigQuery, and more, all within Databricks. Additionally, Unity Catalog's advanced security features, such as row and column level access controls, along with discovery features like tags and data lineage, are extended to these external data sources, ensuring consistent governance practices.
Governance for AI - Unifying data and AI catalogs under one roof
We are also expanding the governance model within Unity Catalog to provide comprehensive management of both AI assets and data in a unified experience. This consolidation simplifies DataOps and MLOps processes, and prepares organizations for AI compliance, by bringing together all the necessary capabilities in one centralized location. Key enhancements include:
Feature Store and Model Registry in Unity Catalog
We announced the public preview of Model Registry in Unity Catalog with the public preview of Feature Store coming later in July. With this capability, Unity Catalog is the only governance solution that brings together all data and ML assets - from data and features to models - into one catalog, ensuring full visibility and fine-grained access controls throughout the AI workflow. This unified approach provides automatic versioning and lineage tracking, centralized governance, and seamless cross-workspace collaboration for simplified MLOps and enhanced productivity. Additionally, with advanced monitoring capabilities, you can now experience improved visibility, quality, understanding and control over your entire AI workflow.
Volumes in Unity Catalog: Govern any non-tabular data
There are many use cases, particularly for machine learning and data science workloads, which require access to non-tabular data, such as image, audio, video, or PDF files.
We announced Volumes in Unity Catalog. Volumes is a new type of object that catalogs collections of files and helps you build scalable file-based applications that read and process large collections of data irrespective of its format, including unstructured, semi-structured, and structured. This enables you to manage, govern and track lineage for non-tabular data along with the tabular data in Unity Catalog. Stay tuned for the public preview of Volumes, coming in the next few weeks!
AI for governance: Lakehouse Monitoring and Lakehouse Observability
Unity Catalog not only offers robust governance capabilities for AI but also harnesses the power of AI to optimize governance workflows. Key enhancements include:
Lakehouse Monitoring: Monitor the quality of your organization's data and AI assets
Ensuring trust in data and AI models is paramount for the success of any organization. To address this critical requirement, we have introduced Databricks Lakehouse Monitoring, an AI-driven monitoring service that encompasses the entire data pipeline, including data, ML models, and features.
Databricks Lakehouse Monitoring provides proactive alerts for quality issues and errors in data and ML model pipelines, including the automatic classification and identification of personally identifiable information (PII) using AI-based data classification technology from Okera, our recent acquisition. Additionally, data teams can effortlessly share comprehensive data and ML quality reports with stakeholders through auto-generated dashboards.
Finally, data teams can effectively debug and perform impact assessment of any issues identified in the monitoring reports by utilizing Unity Catalog's real-time data lineage, down to the column level. This streamlines monitoring and diagnostics workflows, providing a comprehensive end-to-end solution.
Lakehouse Observability: System tables and dashboards for all aspects of lakehouse
Observability is a critical aspect of any Data and AI workload. To address this requirement, we announced the public preview of System Tables for auditing, lineage and billing in Unity Catalog, with additional tables coming later this year.
System Tables serve as a centralized analytical store and provide comprehensive cost and usage analytics, offering valuable insights into resource consumption and expenditure. Additionally, System Tables allow users to perform audit analytics for jobs, notebooks, clusters, and SQL/ML endpoints, track data lineage and access permissions. With the ability to easily query System Tables in Unity Catalog using any language, users can build customized dashboards and notebooks, and leverage the power of AI to transform operational data into actionable business insights. Finally, users can further operationalize this intelligence with DBSQL alerts to systemically drive RoI improvements into their end-to-end intelligent data application lifecycle.
Additional advancements in governance on the Lakehouse
Row and Column-level data security
To enhance data security effectively at the granular level, Unity Catalog provides row filtering and column masking. Users can leverage standard SQL functions to define row filters and column masks, enabling fine-grained access controls at the level of individual rows and columns. This functionality is in public preview on AWS, Azure, and GCP.
Tags for data classification
Unity Catalog goes beyond just discovery and provides contextual insights about the data, enabling users to jumpstart their work and accelerate analytics and AI initiatives. Users can easily describe and tag data assets to improve understanding, gain insights into the popularity of an asset, identify domain experts, and frequently used notebooks/queries/joins, making data enrichment a breeze.
LakehouseIQ: The AI-powered engine that uniquely understands your business
We also announced LakehouseIQ, a knowledge engine that learns the unique nuances of your business and the complex layers of your data, enabling seamless natural language access to the right data at the right time. LakehouseIQ is powered by Unity Catalog, which provides the metadata and lineage leveraged by the AI while ensuring the organization's internal security and governance policies are consistently enforced.
Getting Started with Databricks Unity Catalog
By embracing Unity Catalog as the cornerstone of your Lakehouse architecture, you can unlock the power of a flexible and scalable governance implementation that spans your entire data and AI estate. To get started, follow the Unity Catalog guides available for AWS, Azure, and GCP.
Watch the Data+AI Summit 2023 keynote from Matei Zaharia, co-founder and Chief Technology Officer at Databricks, to learn more. Register for Data + AI Summit and explore the top data and AI governance sessions.