An Automated Guide to Distributed and Decentralized Management of Unity Catalog

How Terraform can enable Unity Catalog deployment at scale for different governance models

Published: December 7, 2022

by Vuong Nguyen, Zeashan Pappa and Mattia Zeni

Unity Catalog provides a unified governance solution for all data and AI assets in your lakehouse on any cloud. As customers adopt Unity Catalog, they want to do this programmatically and automatically, using infrastructure as a code approach. With Unity Catalog, there is a single metastore per region, which is the top-level container of objects in Unity Catalog. It stores data assets (tables and views) and the permissions that govern access.

This presents a new challenge for organizations that do not have centralized platform/governance teams to own the Unity Catalog management function. Specifically, teams within these organizations now have to collaborate and work together on a single metastore, i.e. how to govern access and perform auditing in complete isolation from each other.

In this blog post, we will discuss how customers can leverage the support for Unity Catalog objects in the Databricks Terraform provider to manage a distributed governance pattern on the lakehouse effectively.

We present two solutions:

One that completely delegates responsibilities to teams when it comes to creating assets in Unity Catalog
One that limits which resources teams can create in Unity Catalog

Creating a Unity Catalog metastore

As a one-off bootstrap activity, customers need to create a Unity Catalog metastore per region they operate in. This requires an account administrator, which is a highly-privileged that is only accessed in breakglass scenarios, i.e. username & password stored in a secret vault that requires approval workflows to be used in automated pipelines.

An account administrator needs to authenticate using their username & password on AWS:

Or using their AAD token on Azure:

The Databricks Account Admin needs to provide:

A single cloud storage location (S3/ADLS), which will be the default location to store data for managed tables
A single IAM role / managed identity, which Unity Catalog will use to access the cloud storage in (1)

The Terraform code will be similar to below (AWS example)

Teams can choose not to use this default location and identity for their tables by setting a location and identity for managed tables per individual catalog, or even more fine-grained at the schema level. When managed tables are created, the data will then be stored using the schema location (if present) falling back to the catalog location (if present), and only fall back to the metastore location if the prior two locations have not been set.

Nominating a metastore administrator

When creating a metastore, we nominated the unity_admin_group as the metastore administrator. To avoid having a central authority that can list and manage access to all objects in the metastore, we will keep this group empty

Users can be added to the group for exceptional break-glass scenarios which require a high powered admin (e.g., setting up initial access, changing ownership of catalog if catalog owner leaves the organization).

Delegating Responsibilities to Teams

Each team is responsible for creating their own catalogs and managing access to its data. Initial bootstrap activities are required for each new team to get the necessary privileges to operate independently.

The account admin then needs to perform the following:

Create a group called team-admins
Grant CREATE CATALOG, CREATE EXTERNAL LOCATION, and optionally GRANT CREATE SHARE, PROVIDER, RECIPIENT if using Delta Sharing to this team

When a new team onboards, place the trusted team admins in the team-admins group

Members of the team-admins group can now easily create new catalogs and external locations for their team without interaction from the account administrator or metastore administrator.

Onboarding new teams

During the process of adding a new team to Databricks, initial activities from an account administrator is required so that the new team is free to set up their workspaces / data assets to their preference:

A new workspace is created either by team X admins (Azure) or the account admin (AWS)
Account admin attaches the existing metastore to the workspace
Account admin creates a group specific to this team called 'team_X_admin' which contains the admins for the team to be onboarded.

Account admin creates a storage credential and changes the owner to 'team_X_admin' group to use them. If the team admins are trusted in the cloud tenant, they can then control what storage the credential has access to (e.g. any of their own S3 buckets or ADLS storage accounts).

Account admin then assigns the newly created workspace to the UC metastore

Team X admins then create any number of catalogs and external locations as required
- Because team admins are not metastore owners or account admins, they cannot interact with any entities (catalogs/schemas/tables etc) that they do not own, i.e. from other teams.

Limited delegation of responsibilities to teams

Some organizations may not want to make teams autonomous in creating assets in their central metastore. In fact, giving multiple teams the ability to create such assets can be difficult to govern, naming conventions cannot be enforced and keeping the environment clean is hard.

In such a scenario, we suggest a model where each team files a request with a list of assets they want admins to create for them. The team will be made owner of the assets so they can be autonomous in assigning permissions to others.

To automate such requests as much as possible, we present how this is done using a CI/CD. The admin team owns a central repository in their preferred versioning system where they have all the scripts that deploy Databricks in their organization. Each team is allowed to create branches on this repository to add the Terraform configuration files for their own environments using a predefined template (Terraform Module). When the team is ready, they create a pull request. At this point, the central admin has to review (this can be also automated with the appropriate checks) the pull request and merge it to the main branch, which will trigger the deployment of the resources for the team.

This approach allows one to have more control over what individual teams do, but it involves some (limited, automatable) activities on the central admins' team.

In this scenario, the Terraform scripts below are executed automatically by the CI/CD pipelines using a Service Principal (00000000-0000-0000-0000-000000000000), which is made account admin. The one-off operation of making such service principal account admin must be manually executed by an existing account admin, for example:

Onboarding new teams

When a new team wants to be onboarded, they need to file a request that will create the following objects (Azure example):

Create a group called team_X_admins, which contains the Account Admin Service Principal (to allow future modifications to the assets) plus the members of the group

A new resource group or specify an existing one

A Premium Databricks workspace

A new Storage Account or provide an existing one

A new Container in the Storage Account or provide an existing one

A Databricks Access Connector

Assign the "Storage blob Data Contributor" role to the Access Connector

Assign the central metastore to the newly created Workspace

Create a storage credential

Create an external location

Create a catalog

Once these objects are created the team is autonomous in developing the project, giving access to other team members and/or partners if necessary.

Modify assets for existing team

Teams are not allowed to modify assets autonomously in Unity Catalog either. To do this they can file a new request with the central team by modifying the files they have created and make a new pull request.

This is true also if they need to create new assets such as new storage credentials, external locations and catalogs.

Unity Catalog + Terraform = well-governed lakehouse

Above, we walked through some guidelines on leveraging built-in product features and recommended best practices to handle enablement and ongoing management hurdles for Unity Catalog.

Visit the Unity Catalog documentation [AWS, Azure], and our Unity Catalog Terraform guide [AWS, Azure] to learn more

What's next?

October 24, 2024/4 min read

Building a Cost-Optimized Chatbot with Semantic Caching

November 20, 2024/4 min read

Creating a Unity Catalog metastore

Nominating a metastore administrator

Delegating Responsibilities to Teams

Onboarding new teams

Limited delegation of responsibilities to teams

Onboarding new teams

Modify assets for existing team

Unity Catalog + Terraform = well-governed lakehouse

Never miss a Databricks post

Sign up

What's next?

Building a Cost-Optimized Chatbot with Semantic Caching

Introducing Predictive Optimization for Statistics