Announcing the General Availability of Row and Column Level Security with Databricks Unity Catalog

Provide Security for Your Tables with Row Filters and Column Masks

Published: August 5, 2024

Product12 min read

by Daniel Tenedorio, Ramesh Chandra, Vitalii Li, Todd Greenstein and Kelly Albano

Do you want to try row filters out right now? Head over to the Catalog Explorer, search for the table you want to filter, and then add a Row filter in the Overview tab. We recommend getting started by applying this simple SQL user-defined function as highlighted in our documentation:

We are excited to announce the general availability of Row Filters and Column Masks in Unity Catalog on AWS, Azure, and GCP! Managing fine-grained access controls on rows and columns in tables is critical to ensure data security and meet compliance. With Unity Catalog, you can use standard SQL functions to define row filters and column masks, allowing fine-grained access controls on rows and columns. Row Filters let you control which subsets of your tables' rows are visible to hierarchies of groups and users within your organization. Column Masks let you redact your table values based on the same dimensions.

"Unity Catalog allowed us to create a unified view of our data estate, simplifying collaboration across teams within BlackBerry. We now have a standard approach to manage access permissions and audit files or tables in our lake, with the ability to define fine-grained access controls on rows and columns. Automated data lineage helped us see where the data is coming from to pinpoint the source of a potential threat and to understand which research projects or teams are leveraging the data for threat detection."
— Justin Lai, Distinguished Data Architect, Blackberry

This blog discusses how you can enable fine-grained access controls using Row Filters and Column Masks.

What is Coarse-Grained Object-Level Security?

Before this announcement, Unity Catalog already supported object-level security. For example, you can use GRANT and REVOKE SQL commands over securable objects such as tables and functions to adjust which users and groups are allowed to inspect, query, or modify them:

We can grant read access to the accounts_team:

Now, the accounts_team has access to query (but not modify) the purchase_history table.

Prior Approaches for Sharing Subsets of Data with Different Groups

But what if we have separate accounts teams for different regions?

One strategy uses dynamic views. You can define a view specifically intended for consumption by specific user(s) or group(s):

This involves no data copying, but users still have to remember to query the accounts.purchase_history_emea table if they are in the EMEA region or the accounts.purchase_history_apac table if they are in the APAC region, and so on.

Dynamic views from an admin perspective have a valid value proposition for certain cases. However, for this example, some constraints apply:

Limited to queries; cannot insert or update data within views
Must create and maintain numerous views for each region
Shared SQL logic is cumbersome to reuse across different regional teams
Causes clutter in the Catalog Explorer

In addition to the above, dynamic views do not provide any protection from downstream users discovering values of rows scanned from referenced tables and then filtered or aggregated out within the view. For example, users could craft WHERE clauses that throw errors in response to certain column values with the intention of the query optimizer pushing these operations down within the view evaluation itself.

As a last resort, we could instead create a daily job to copy subsets of data into different tables and set their permissions accordingly:

While this approach effectively addresses query needs, it comes with drawbacks. By duplicating data, we increase storage and compute usage. Also, the duplicated data lags behind the original, introducing staleness. Moreover, this solution caters solely to queries due to restricted user permissions, limiting write access to the primary table.

Introducing Row Filters

With row filters, you can apply predicates to a table, ensuring that only rows meeting specific criteria are returned in subsequent queries.

Each row filter is implemented as a SQL user-defined function (UDF). To begin, write a SQL UDF with a boolean result whose parameter type(s) are the same as the column(s) of your target table that you want to control access by.

For consistency, let's continue using the region column of the previous accounts.purchase_history table for this purpose.

We can test this logic by performing a few queries over the target table and applying the function directly. For someone in the accounts_team_emea group, such a query might look like this:

Or for someone in the admin group who is setting the access control logic in the first place, we find that all rows from the table are returned:

Now we're ready to apply this logic to our target table as a policy function, and grant read access to the accounts_team_emea group:

Or, we can assign this policy directly to the table at creation time to make sure there is no period where the table exists, but the policy does not yet apply:

After that, querying from the table should return the subsets of rows corresponding to the results of our testing above. For example, the accounts_team_emea members will receive the following result:

But what if we want to write new data to the table? This would not be possible if accounts.purchase_history was a dynamic view, but since it is a table with a row filter, we can simply use SQL to insert new rows as needed:

Now, we can share the same accounts.purchase_history table with different groups without copying the data or adding many new names into our namespace.

You can view this information on the Catalog Explorer. Looking at the purchase_history table, we see that a row filter applies:

Clicking on the row filter, we can see the policy function name:

Following the "view" button reveals the function contents:

Introducing Column Masks

We have demonstrated how to create and apply fine-grained access controls to tables using row filters, selectively filtering out rows that the invoking user does not have access to read at query time. But what if we want to control access to columns instead, eliding some column values and leaving others intact within each row?

Announcing column masks!

Each column mask is also implemented as a SQL user-defined function (UDF). However, unlike row filter functions returning boolean results, each column mask policy function accepts one argument and returns the same type as this input argument. For example, we can create column masks to filter out PII in email addresses with policies like this:

For our running accounts.purchase_history table here, let's go ahead and mask out the purchase amount column when the value is more than one thousand:

Now, only administrators have permission to look at the purchase amounts of $10 or greater.

Let's go ahead and test the policy function. Non-admin users see this:

But administrators have access to view all the data:

Looks great! Let's apply the mask to our table:

After that, querying from the table should redact specific column values corresponding to the results of our testing above. For example, non-administrators will receive the following result:

It works correctly.

We can also inspect the values of other columns to make our masking decision. For example, we can modify the function to look at the region column instead of the purchase amount:

Now we can apply the mask with the USING COLUMNS clause to specify the additional column name(s) to pass into the policy function:

Thereafter, querying from the table should redact certain column values differently for non-administrators:

We can look at the mask by looking at the table column in the Catalog Explorer:

Like before, following the "view" button reveals the function contents:

Storing Access Control Lists in Mapping Tables

Row filter and column mask policy functions almost always need to refer to the current user and compare it against a list of allowed users or check its group memberships against an explicit list of allowed groups. Listing these user and group allowlists in the policy functions themselves works well for lists of reasonable sizes. For larger lists or cases where we would prefer extra assurance that the identities of the users or groups themselves are hidden from view for users, we can take advantage of mapping tables instead.

These mapping tables act like personalized gatekeepers, deciding which data rows users or groups can access in your original table. The beauty of mapping tables lies in their seamless integration with fact tables, making your data security strategy more effective.

This approach is a game-changer for various custom requirements:

Tailored User Access: You can impose restrictions based on individual user profiles while accommodating specific rules for user groups. This ensures that each user sees only what they should.
Handling Complex Hierarchies: Whether it's intricate organizational structures or diverse sets of rules, mapping tables can navigate the complexities, ensuring that data access adheres to your unique hierarchy.
Seamless External Model Replication: Replicating complex security models from external source systems becomes a breeze. Mapping tables help you mirror these intricate setups without breaking a sweat.

For example:

Now, we can extend the accounts.purchase_history_groups table to large numbers of groups without making the policy function itself complex, and also restrict access to the rows of that table to only the administrator that created the accounts.purchase_history_row_filter SQL UDF.

Using Row and Column Level Security with Lakehouse Federation

With Lakehouse Federation, Unity Catalog solves critical data management challenges to simplify how organizations handle disparate data systems. It allows you to create a unified view of your entire data estate, structured and unstructured, enabling secure access and exploration for all users regardless of data source. It allows efficient querying and data combination through a single engine, accelerating various data analysis and AI applications without requiring data ingestion. Additionally, it provides a consistent permission model for data security, applying access rules and ensuring compliance across different platforms.

The fine-grained access controls announced here work seamlessly with Lakehouse Federation tables to support sharing access to federated tables within your organizations with custom row and column-level access policies for different groups. There is no need to copy data or create many duplicate or similar table/view names in your catalogs.

For example, you can create a federated connection to an existing MySQL database. Then, browse the Catalog Explorer to inspect the foreign catalog:

Inside the catalog, we find a mysql_demo_nyc_pizza_rating table:

Let's apply our row filter to that table:

Looking at the table overview afterward, it reflects the change:

Clicking on the row filter reveals the name of the function, just like before:

Now, queries over this federated MySQL table will return different subsets of rows depending on each invoking user's identity and group memberships. We've successfully integrated fine-grained access control with Lakehouse Federation, resulting in simplified usability and unified governance for Delta Lake and MySQL tables in the same organization.

Getting started with Row and Column Level Security

With Row Filters and Column Masks, you now gain the power to streamline your data management, eliminating excessive ETL pipelines and data copies. This is your gateway to a new world of unified data security, where you can confidently share data with multiple users and groups while maintaining control and ensuring that sensitive information remains protected.

To get started with Row Filters and Column Masks, check out our documentation on AWS, Azure, and GCP. You can run workloads that query tables with Row Filters and Column Masks on any UC Compute: Serverless, shared access mode, Single-user access mode (from DBR 15.4 onwards)

What's next?

November 20, 2024/4 min read

Introducing Predictive Optimization for Statistics

November 21, 2024/3 min read