An Automated Guide to Distributed and Decentralized Management of Unity Catalog
Unity Catalog provides a unified governance solution for all data and AI assets in your lakehouse on any cloud. As customers adopt Unity Catalog, they want to do this programmatically and automatically, using infrastructure as a code approach. With Unity Catalog, there is a single metastore per region, which is the top-level container of objects in Unity Catalog. It stores data assets (tables and views) and the permissions that govern access.
This presents a new challenge for organizations that do not have centralized platform/governance teams to own the Unity Catalog management function. Specifically, teams within these organizations now have to collaborate and work together on a single metastore, i.e. how to govern access and perform auditing in complete isolation from each other.
In this blog post, we will discuss how customers can leverage the support for Unity Catalog objects in the Databricks Terraform provider to manage a distributed governance pattern on the lakehouse effectively.
We present two solutions:
- One that completely delegates responsibilities to teams when it comes to creating assets in Unity Catalog
- One that limits which resources teams can create in Unity Catalog
Creating a Unity Catalog metastore
As a one-off bootstrap activity, customers need to create a Unity Catalog metastore per region they operate in. This requires an account administrator, which is a highly-privileged that is only accessed in breakglass scenarios, i.e. username & password stored in a secret vault that requires approval workflows to be used in automated pipelines.
An account administrator needs to authenticate using their username & password on AWS:
provider "databricks" {
host = "https://accounts.cloud.databricks.com"
account_id = var.databricks_account_id
username = var.databricks_account_username
password = var.databricks_account_password
}
Or using their AAD token on Azure:
provider "databricks" {
host = "https://accounts.azuredatabricks.net"
account_id = var.databricks_account_id
auth_type = "azure-cli" # or azure-client-secret or azure-msi
}
The Databricks Account Admin needs to provide:
- A single cloud storage location (S3/ADLS), which will be the default location to store data for managed tables
- A single IAM role / managed identity, which Unity Catalog will use to access the cloud storage in (1)
The Terraform code will be similar to below (AWS example)
resource "databricks_metastore" "this" {
name = "primary"
storage_root = var.central_bucket
owner = var.unity_admin_group
force_destroy = true
}
resource "databricks_metastore_data_access" "this" {
metastore_id = databricks_metastore.this.id
name = aws_iam_role.metastore_data_access.name
aws_iam_role {
role_arn = aws_iam_role.metastore_data_access.arn
}
is_default = true
}
Teams can choose not to use this default location and identity for their tables by setting a location and identity for managed tables per individual catalog, or even more fine-grained at the schema level. When managed tables are created, the data will then be stored using the schema location (if present) falling back to the catalog location (if present), and only fall back to the metastore location if the prior two locations have not been set.
Nominating a metastore administrator
When creating a metastore, we nominated the unity_admin_group
as the metastore administrator. To avoid having a central authority that can list and manage access to all objects in the metastore, we will keep this group empty
resource "databricks_group" "admin_group" {
display_name = var.unity_admin_group
}
Users can be added to the group for exceptional break-glass scenarios which require a high powered admin (e.g., setting up initial access, changing ownership of catalog if catalog owner leaves the organization).
resource "databricks_user" "break_glass" {
for_each = toset(var.break_glass_users)
user_name = each.key
force = true
}
resource "databricks_group_member" "admin_group_member" {
for_each = toset(var.break_glass_users)
group_id = databricks_group.admin_group.id
member_id = databricks_user.break_glass[each.value].id
}
Delegating Responsibilities to Teams
Each team is responsible for creating their own catalogs and managing access to its data. Initial bootstrap activities are required for each new team to get the necessary privileges to operate independently.
The account admin then needs to perform the following:
- Create a group called
team-admins
Grant CREATE CATALOG, CREATE EXTERNAL LOCATION
, and optionallyGRANT CREATE SHARE, PROVIDER, RECIPIENT
if using Delta Sharing to this team
resource "databricks_group" "team_admins" {
display_name = "team-admins"
}
resource "databricks_grants" "sandbox" {
metastore = databricks_metastore.this.id
grant {
principal = databricks_group.team_admins.display_name
privileges = ["CREATE_CATALOG", "CREATE_EXTERNAL_LOCATION", "CREATE SHARE", "CREATE PROVIDER", "CREATE RECIPIENT"]
}
}
When a new team onboards, place the trusted team admins in the team-admins group
resource "databricks_user" "team_admins" {
for_each = toset(var.team_admins)
user_name = each.key
force = true
}
resource "databricks_group_member" "team_admin_group_member" {
for_each = toset(var.team_admins)
group_id = databricks_group.team_admins.id
member_id = databricks_user.team_admins[each.value].id
}
Members of the team-admins
group can now easily create new catalogs and external locations for their team without interaction from the account administrator or metastore administrator.
Onboarding new teams
During the process of adding a new team to Databricks, initial activities from an account administrator is required so that the new team is free to set up their workspaces / data assets to their preference:
- A new workspace is created either by team X admins (Azure) or the account admin (AWS)
- Account admin attaches the existing metastore to the workspace
- Account admin creates a group specific to this team called 'team_X_admin' which contains the admins for the team to be onboarded.
resource "databricks_group" "team_X_admins" {
display_name = "team_X_admins"
}
resource "databricks_user" "team_X_admins" {
for_each = toset(var.team_X_admins)
user_name = each.key
force = true
}
resource "databricks_group_member" "team_X_admin_group_member" {
for_each = toset(var.team_X_admins)
group_id = databricks_group.team_X_admins.id
member_id = databricks_user.team_X_admins[each.value].id
}
- Account admin creates a storage credential and changes the owner to 'team_X_admin' group to use them. If the team admins are trusted in the cloud tenant, they can then control what storage the credential has access to (e.g. any of their own S3 buckets or ADLS storage accounts).
resource "databricks_storage_credential" "external" {
name = "team_X_credential"
azure_managed_identity {
access_connector_id = azurerm_databricks_access_connector.ext_access_connector.id
}
comment = "Managed by TF"
owner = databricks_group.team_X_admins.display_name
}
- Account admin then assigns the newly created workspace to the UC metastore
resource "databricks_metastore_assignment" "this" {
workspace_id = var.databricks_workspace_id
metastore_id = databricks_metastore.this.id
default_catalog_name = "hive_metastore"
}
- Team X admins then create any number of catalogs and external locations as required
- Because team admins are not metastore owners or account admins, they cannot interact with any entities (catalogs/schemas/tables etc) that they do not own, i.e. from other teams.
Limited delegation of responsibilities to teams
Some organizations may not want to make teams autonomous in creating assets in their central metastore. In fact, giving multiple teams the ability to create such assets can be difficult to govern, naming conventions cannot be enforced and keeping the environment clean is hard.
In such a scenario, we suggest a model where each team files a request with a list of assets they want admins to create for them. The team will be made owner of the assets so they can be autonomous in assigning permissions to others.
To automate such requests as much as possible, we present how this is done using a CI/CD. The admin team owns a central repository in their preferred versioning system where they have all the scripts that deploy Databricks in their organization. Each team is allowed to create branches on this repository to add the Terraform configuration files for their own environments using a predefined template (Terraform Module). When the team is ready, they create a pull request. At this point, the central admin has to review (this can be also automated with the appropriate checks) the pull request and merge it to the main branch, which will trigger the deployment of the resources for the team.
This approach allows one to have more control over what individual teams do, but it involves some (limited, automatable) activities on the central admins' team.
In this scenario, the Terraform scripts below are executed automatically by the CI/CD pipelines using a Service Principal (00000000-0000-0000-0000-000000000000), which is made account admin. The one-off operation of making such service principal account admin must be manually executed by an existing account admin, for example:
resource "databricks_service_principal" "sp" {
application_id = "00000000-0000-0000-0000-000000000000"
}
resource "databricks_service_principal_role" "sp_account_admin" {
service_principal_id = databricks_service_principal.sp.id
role = "account admin"
}
Onboarding new teams
When a new team wants to be onboarded, they need to file a request that will create the following objects (Azure example):
- Create a group called
team_X_admins
, which contains the Account Admin Service Principal (to allow future modifications to the assets) plus the members of the group
resource "databricks_group" "team_X_admins" {
display_name = "team_X_admins"
}
resource "databricks_user" "team_X_admins" {
for_each = toset(var.team_X_admins)
user_name = each.key
force = true
}
resource "databricks_group_member" "team_X_admin_group_member" {
for_each = toset(var.team_X_admins)
group_id = databricks_group.team_X_admins.id
member_id = databricks_user.team_X_admins[each.value].id
}
data "databricks_service_principal" "service_principal_admin" {
application_id = "00000000-0000-0000-0000-000000000000"
}
resource "databricks_group_member" "service_principal_admin_member" {
group_id = databricks_group.team_X_admins.id
member_id = databricks_service_principal.service_principal_admin.id
}
- A new resource group or specify an existing one
resource "azurerm_resource_group" "this" {
name = var.resource_group_name
location = var.resource_group_region
}
- A Premium Databricks workspace
resource "azurerm_databricks_workspace" "this" {
name = var.databricks_workspace_name
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
sku = "premium"
}
- A new Storage Account or provide an existing one
resource "azurerm_storage_account" "this" {
name = var.storage_account_name
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
account_tier = "Standard"
account_replication_type = "LRS"
account_kind = "StorageV2"
is_hns_enabled = "true"
}
- A new Container in the Storage Account or provide an existing one
resource "azurerm_storage_container" "container" {
name = "container"
storage_account_name = azurerm_storage_account.this.name
container_access_type = "private"
}
- A Databricks Access Connector
resource "azurerm_databricks_access_connector" "this" {
name = var.databricks_access_connector_name
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
identity {
type = "SystemAssigned"
}
}
- Assign the "Storage blob Data Contributor" role to the Access Connector
resource "azurerm_role_assignment" "this" {
scope = azurerm_storage_account.this.id
role_definition_name = "Storage Blob Data Contributor"
principal_id = azurerm_databricks_access_connector.metastore.identity[0].principal_id
}
- Assign the central metastore to the newly created Workspace
resource "databricks_metastore_assignment" "this" {
metastore_id = databricks_metastore.this.id
workspace_id = azurerm_databricks_workspace.this.workspace_id
}
- Create a storage credential
resource "databricks_storage_credential" "storage_credential" {
name = "mi_credential"
azure_managed_identity {
access_connector_id = azurerm_databricks_access_connector.this.id
}
comment = "Managed identity credential managed by TF"
owner = databricks_group.team_X_admins
}
- Create an external location
resource "databricks_external_location" "external_location" {
name = "external"
url = format("abfss://%s@%s.dfs.core.windows.net/",
"container",
"storageaccountname"
)
credential_name = databricks_storage_credential.storage_credential.id
comment = "Managed by TF"
owner = databricks_group.team_X_admins
depends_on = [
databricks_metastore_assignment.this, databricks_storage_credential.storage_credential
]
}
- Create a catalog
resource "databricks_catalog" "this" {
metastore_id = databricks_metastore.this.id
name = var.databricks_catalog_name
comment = "This catalog is managed by terraform"
owner = databricks_group.team_X_admins
storage_root = format("abfss://%s@%s.dfs.core.windows.net/managed_catalog",
"container",
"storageaccountname"
)
}
Once these objects are created the team is autonomous in developing the project, giving access to other team members and/or partners if necessary.
Modify assets for existing team
Teams are not allowed to modify assets autonomously in Unity Catalog either. To do this they can file a new request with the central team by modifying the files they have created and make a new pull request.
This is true also if they need to create new assets such as new storage credentials, external locations and catalogs.
Unity Catalog + Terraform = well-governed lakehouse
Above, we walked through some guidelines on leveraging built-in product features and recommended best practices to handle enablement and ongoing management hurdles for Unity Catalog.
Visit the Unity Catalog documentation [AWS, Azure], and our Unity Catalog Terraform guide [AWS, Azure] to learn more