In our previous blog, we discussed the practical challenges related to scaling out a data platform across multiple teams and how lack of automation adversely affects innovation and slows down go-to-market. Enterprises need consistent and scalable solutions that could utilize repeatable templates to seamlessly comply with enterprise governance policies, with a goal to bootstrap unified data analytics environments across data teams. With Microsoft Azure Databricks, we've taken a API-first approach for all objects that enables quick provisioning & bootstrapping of cloud computing data environments, by integrating into existing Enterprise DevOps tooling without requiring customers to reinvent the wheel. In this article, we will walk through such a cloud deployment automation process using different Azure Databricks APIs.
The process for configuring an Azure Databricks data environment looks like the following:
To accomplish the above, we will be using APIs for the following IaaS features or capabilities available as part of Azure Databricks:
There are a few options available to use the Azure Databricks APIs:
To keep things simple, we'll use the Postman approach below.
Please go ahead and pre-create an Azure resource group. We will be deploying Azure Databricks workspace in a customer managed virtual network (VNET). VNET pre-creation is optional. Please refer to this guide to understand VNET requirements.
We will be using an Azure Service Principal to automate the deployment process, using this guide please create a service principal. Please generate a new client secret and make sure to note down the following details:
Navigate to Azure Resource Group where you plan to deploy Azure Databricks workspace and add the "Contributor" role to your service principal.
We will be using the Azure Databricks ARM REST API option to provision a workspace. This is not to be confused with the REST API for different objects within a workspace.
Download postman collection from here.
The collection consists of several sections
Environment config file is already imported into postman, please go ahead and edit it by clicking on the "gear" button.
Configure environment as per your settings
Variable Name | Value | Description |
Azure subscription details | ||
tenantId | Azure Tenant ID | Locate it here |
subscriptionId | Azure Subscription ID | Locate it here |
clientCredential | Service Principal Secret | |
clientId | Service Principal ID | |
resourceGroup | Resource group name | User defined resource group |
Constant's used | ||
managementResource | https://management.core.windows.net/ | Constant, more details here |
databricksResourceId | 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d | Constant, unique applicationId that identifies Azure Databricks workspace resource inside azure |
Azure Databricks deployment via ARM template specific variables | ||
workspaceName | Ex: adb-dev-workspace | unique name given to the Azure Databricks workspace |
VNETCidr | Ex: 11.139.13.0/24 | More details here |
VNETName | Ex: adb-VNET | unique name given to the VNET where ADB is deployed, if a VNET exists we will use it, otherwise it will create a new one. |
publicSubnetName | Ex: adb-dev-pub-sub | unique name given to the subnet within the VNET where Azure Databricks is deployed. We highly recommend that you let ARM template create this subnet rather than you pre creating it. |
publicSubnetCidr | Ex: 11.139.13.64/26 | More details here |
privateSubnetName | Ex: adb-dev-pvt-sub | unique name given to the subnet within the VNET where ADB is deployed. We highly recommend that you let ARM template create this subnet rather than you pre creating it. |
privateSubnetCidr | Ex: 11.139.13.128/26 | More details here |
nsgName | Ex: adb-dev-workspace-nsg | Network Security Group attached to Azure Databricks subnets. |
pricingTier | premium | Options available premium or standard , more details here, IP-Access-List feature requires premium tier |
workspace tags | ||
tag1 | Ex: dept101 | Demonstrating how to set tags on Azure Databricks workspace |
We will be using Azure AD access token to deploy the workspace, utilizing the OAuth Client Credential workflow, which is also referred to as two-legged OAuth to access web-hosted resources by using the identity of an application. This type of grant is commonly used for server-to-server interactions that must run in the background, without immediate interaction with a user.
Executing aad token for management resource API returns AAD access token which will be used to deploy the Azure Databricks workspace, and to retrieve the deployment status. Access token is valid for 599 seconds by default, if you run into token expiry issues then please go ahead and rerun this API call to regenerate access token.
ARM templates are utilized in order to deploy Azure Databricks workspace. ARM template is used as a request body payload in step provision databricks workspace inside Provisioning Workspace section as highlighted above.
If subnets specified in the ARM template exist then we will use those otherwise those will be created for you. Azure Databricks workspace will be deployed within your VNET, and a default Network Security Group will be created and attached to subnets used by the workspace.
Workspace deployment takes approximately 5-8 minutes. Executing "get deployment status and workspace url" call returns workspace URL which we'll use in subsequent calls.
We set a global variable called "workspaceUrl" inside the test step to extract value from the response. We use this global variable in subsequent API calls.
A note on using Azure Service Principal as an identity in Azure Databricks
Please note that Azure Service Principal is considered a first class identity in Azure Databricks and as such can invoke all of the API's. One thing that sets them apart from user identities is that service principals do not have access to the web application UI i.e. they cannot log into the workspace web application and perform UI functions they way a typical user like you and me would perform. Service principals are primarily used to invoke API in a headless fashion.
To authenticate and access Azure Databricks REST APIs, we can use of the following:
In this section we demonstrate usage of both of these tokens
To generate AAD token for the service principal we'll use the client credentials flow for the AzureDatabricks login application resource which is uniquely identified using the object resource id 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d.
Response contains an AAD access token. We'll set up a global variable "access_token" by extracting this value.
Please note that the AAD access token generated is a bit different from the one that we have generated earlier to create the workspace, AAD token for workspace deployment is generated for the Azure management resource where as AAD access token to interact with API is for Azure Databricks Workspace resource.
To generate Azure Databricks platform access token for the service principal we'll use access_token generated in the last step for authentication.
Executing generate databricks platform token for service principal returns platform access token, we then set a global environment variable called sp_pat based on this value. To keep things simple we will be using sp_pat for authentication for the rest of the API calls.
The SCIM API allows you to manage
Azure Databricks supports SCIM or System for Cross-domain Identity Management, an open standard that allows you to automate user provisioning using a REST API and JSON. The Azure Databricks SCIM API follows version 2.0 of the SCIM protocol.
Please note that Azure Service Principal is considered a first class identity in Azure Databricks and as such can invoke all of the API's. One thing that sets them apart from user identities is that service principals do not have access to the web application UI i.e. they cannot log into the workspace web application and perform UI functions they way a typical user like you and me would perform. Service principals are primarily used to invoke API in a headless fashion.
Token Management provides Azure Databricks administrators with more insight and control over Personal Access Tokens in their workspaces. Please note that this does not apply to AAD tokens as they are managed within Azure AD.
By monitoring and controlling token creation, you reduce the risk of lost tokens or long-lasting tokens that could lead to data exfiltration from the workspace.
A cluster policy limits the ability to create clusters based on a set of rules. A policy defines those rules as limitations on the attributes used for the cluster creation. Cluster policies define ACLs to limit their use to specific users and and groups. For more details please refer to our blog on cluster policies.
Only admin users can create, edit, and delete policies. Admin users also have access to all policies.
Clusters Permission API allows permissions for users and groups on clusters (both interactive and job clusters). The same process could be used for Jobs, Pools, Notebooks, Folders, Model Registry and Tokens.
You may have a security policy which mandates that all access to Azure Databricks workspaces goes through your network and web application proxy. Configuring IP Access Lists ensure that employees have to connect via corporate VPN before accessing a workspace.
This feature provides Azure Databricks admins a way to set a `allowlist` and `blocklist` for `CIDR / IPs` that could access a workspace.
Azure Databricks platform APIs not only enable data teams to provision and secure enterprise grade data platforms but also help automate some of the most mundane but crucial tasks from user onboarding to setting up secure perimeter around these platforms.
As the unified data analytics platform is scaled across data teams, challenges in terms of workspace provisioning, resource configuration, overall management and compliance with enterprise governance multiply for the admins. End-to-End automation is a highly recommended best practice to address any such concerns and have better repeatability & reproducibility across the board.
We want to make workspace administration super simple, so that you get to do more and focus on solving some of the world's toughest data challenges.
Please rerun step generate aad token for management resource to regenerate management access token. Token has a time to live of 599 seconds.
The Azure Databricks REST API supports a maximum of 30 requests/second per workspace. Requests that exceed the rate limit will receive a 429 response status code.
Common Token Issues are listed over here along with mitigation