Data lake systems such as S3, ADLS, and GCS store the majority of data in today’s enterprises thanks to their scalability, low cost, and open interfaces. Over time, these systems have also become an attractive place to process data thanks to lakehouse technologies such as Delta Lake that enable ACID transactions and fast queries. However, one area where data lakes have remained harder to manage than traditional databases is governance; so far, these systems have only offered tools to manage permissions at the file level (e.g. S3 and ADLS ACLs), using cloud-specific concepts like IAM roles that are unfamiliar to most data professionals.
That’s why we’re thrilled to announce our Unity Catalog, which brings fine-grained governance and security to lakehouse data using a familiar, open interface. Unity Catalog lets organizations manage fine-grained data permissions using standard ANSI SQL or a simple UI, enabling them to safely open their lakehouse for broad internal consumption. It works uniformly across clouds and data types. Finally, it goes beyond managing tables to govern other types of data assets, such as ML models and files. Thus, enterprises get a simple way to govern all their data and AI assets:
What’s hard with data lake governance tools today?
Although all cloud storage systems (e.g. S3, ADLS and GCS) offer security controls today, these tools are file-oriented and cloud-specific, both of which cause problems as organizations scale up. We’ve often seen customers run into four problems:
- Lack of fine-grained (row, column and view level) security: Cloud data lakes can generally only set permissions at the file or directory level, making it hard to share just a subset of a table with particular users. This makes it tedious to onboard enterprise users who should not have access to the whole table.
- Governance tied to physical data layout: Because governance controls are at the file level, data teams must carefully structure their data layout to support the desired policies. For example, a team might partition data into different directories by country and give access to each directory to different groups. But what should the team do when governance rules change? If different states inside one country adopt different data regulations, the organization may need to restructure all its data.
- Nonstandard, cloud-specific interfaces: Cloud governance APIs such as IAM are unfamiliar to data professionals (e.g., database administrators), and different across clouds. Today, enterprises increasingly have to store data in multiple clouds, (e.g., to satisfy privacy regulations), so they need to be able to manage data across clouds.
- No support for other asset types: Data lake governance APIs work for files in the lake, but modern enterprise workflows produce a wide range of other types of data assets. For example, SQL workflows often revolve around views, data science workloads produce ML models, and many workloads connect to data sources other than the lake (e.g., databases). In the modern compliance landscape, all of these assets need to be governed the same way if they contain sensitive data. Thus, data teams have to reimplement the same security policies in many different systems.
Unity Catalog’s approach
Unity Catalog solves these problems by implementing a fine-grained approach to data governance based on open standards that works across data asset types and clouds. It is designed around four key principles:
- Fine-grained permissions: Unity Catalog can enforce permissions for data at the row, column or view level instead of the file level, so that you can always share just part of your data with a new user without copying it.
- An open, standard interface: Unity Catalog’s permission model is based on ANSI SQL, making it instantly familiar to any database professional. We’ve also built a UI to make governance easy for data stewards, and we’ve extended the SQL model to support attribute-based access control, allowing you to tag many objects with the same attribute (e.g., “PII data”) and apply one policy to all of them. Finally, the same SQL based interface can be used to manage ML models and external data sources.
- Central control: Unity Catalog can work across multiple Databricks workspaces, geographic regions and clouds, allowing you to manage all enterprise data centrally. This central position also enables it to track lineage and audit all accesses.
- Secure access from any platform: Although we love the Databricks platform, we know that many customers will also access the data from other platforms and that they’d like their governance rules to work across them. Unity Catalog enforces security permissions from any client that connects through JDBC/ODBC or through Delta Sharing, the open protocol we’ve launched to exchange large datasets between a wide range of platforms.
Let’s look at how the Unity Catalog can be used to implement common governance tasks.
Easily manage permissions using ANSI SQL
Unity Catalog brings fine-grained centralized governance to all data assets across clouds through the open standard ANSI SQL Data Control Language (DCL). This means administrators can easily grant permission to arbitrary user-specific subsets of the data using familiar SQL — no need to learn an arcane, cloud-specific interface. We’ve also added a powerful tagging feature that lets you control access to multiple data items at once based on attributes to further simplify governance at scale.
Below are a few examples of how you can use SQL grant statements with the Unity Catalog to add permissions to existing data stored on your data lake.
First, you can create tables in the catalog either from scratch or by pointing to existing data in a cloud storage system, such as S3, accessed with cloud-specific credentials:
CREATE EXTERNAL TABLE iot_events LOCATION s3:/... WITH CREDENTIAL iot_iam_role
You can now simply use SQL standard
GRANT statements to set permissions, as in any database. Below is an example of how to grant permissions to iot_events to an entire group such as engineers, or to just the date and country columns to the marketing group:
GRANT SELECT ON iot_events TO engineers GRANT SELECT(date, country) ON iot_events TO marketing
The Unity Catalog also understands SQL views. This allows you to create SQL views to aggregate data in a complex way. Here is how you can use View-Based Access Control to grant access to only an aggregate version of the data for business_analysts:
CREATE VIEW aggregate_data AS SELECT date, country, COUNT(*) AS num_events FROM iot_events GROUP BY date, country GRANT SELECT ON aggregate_data TO business_analysts
In addition, the Unity Catalog allows you to set policies across many items at once using attributes (Attribute-Based Access Control), a powerful way to simplify governance at scale. For example, you can tag multiple columns as PII and manage access to all columns tagged as PII in a single rule:
ALTER TABLE iot_events ADD ATTRIBUTE pii ON email ALTER TABLE users ADD ATTRIBUTE pii ON phone GRANT SELECT ON DATABASE iot_data HAVING ATTRIBUTE NOT IN (pii) TO product_managers
Finally, the same attribute system lets you easily govern MLflow models and other objects in a consistent way with your raw data:
GRANT EXECUTE ON MODELS HAVING ATTRIBUTE (eu_data) TO eu_product_managers
Discover and govern data assets in the UI
Unity Catalog’s UI makes it easy to discover, describe, audit and govern data assets in one place. Data stewards can set or review all permissions visually, and the catalog captures audit and lineage information that shows you how each data asset was produced and accessed. The UI is designed for collaboration so that data users can document each asset and see who uses it.
Share data across organizations with Delta Sharing
Every organization needs to share data with customers, partners and suppliers to collaborate. Unity Catalog implements the open source Delta Sharing standard to let you securely share data across organizations, regardless of which computing platform or cloud they run on (any Delta Sharing client can connect to the data).
Open interfaces for easy access
Unity Catalog works with your existing catalogs, data, storage and computing systems so you can leverage your existing investments and build a future-proof governance model. It can mount existing data in Apache Hive Metastores or cloud storage systems such as S3, ADLS and GCS without moving it. It also connects with governance platforms like Privacera and Immuta to let you define custom workflows for managing access to data. Finally, we designed Unity Catalog so that you can also access it from computing platforms other than Databricks: ODBC/JDBC interfaces and high-throughput access via Delta Sharing allow you to securely query your data any computing system.
As shared in our keynote today, we’re very excited to begin the preview of the Unity Catalog shortly. You can already sign up to join our waitlist. We look forward to your feedback!