Databricks provides a managed Apache Spark platform to simplify running production applications, real-time data exploration, and infrastructure complexity. A key piece of the infrastructure is the Apache Hive Metastore, which acts as a data catalog that abstracts away the schema and table properties to allow users to quickly access the data.
The Databricks platform provides a fully managed Hive Metastore that allows users to share a data catalog across multiple Spark clusters. We realize that users may already have a Hive Metastore that they would like to integrate with Databricks, so we also support the seamless integration with your existing Hive Metastore. This allows Databricks to integrate with existing systems such as EMR, Cloudera, or any system running a Hive Metastore. This blog outlines the technical details.
Apache Hive Metastore Background
Hive is a component that was added on top of Hadoop to provide SQL capabilities to the big data environment. It began with a Hive client and a Hive Metastore. Users would execute queries that were compiled into a MapReduce job. To abstract the underlying data structures and file locations, Hive applied a table schema on top of the data. This schema is placed in a database and managed by a metastore process.
CREATE EXTERNAL TABLE users (name STRING, age INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3a://my-bucket/users-csv-table/';
CREATE EXTERNAL TABLE users_parquet (name STRING, age INT)
STORED AS PARQUET
LOCATION 's3a://my-bucket/users-parquet-table/';
The above examples show that the metastore abstracts the underlying file types, and allows the users to interact with the data at a higher level.
Traditionally, Hadoop clusters will have one Hive Metastore per cluster. A cluster would be composed of Apache HDFS, Yarn, Hive, Spark. The Hive Metastore has a metastore proxy service that users connect to, and the data is stored in a relational database. Hive supports a variety of backend databases to host the defined schema, including MySql, Postgres, Oracle.
Supported Apache Hive Versions
Apache Spark supports multiple versions of Hive, from 0.12 up to 1.2.1. This allows users to connect to the metastore to access table definitions. Configurations for setting up a central Hive Metastore can be challenging to verify that the corrects jars are loaded, the correction configurations are applied, and the proper versions are supported.
Integration How-To
In the cloud, clusters are viewed as transient compute resources. Customers want the flexibility and elasticity of a cloud environment by leveraging the fact that compute resources can be shut down. One item that needs to be highly available is the Hive Metastore process. There are two ways to integrate with the Hive Metastore process.
- Connect directly to the backend database
- Configure clusters to connect to the Hive Metastore proxy server
Users follow option #2 if they need to integrate with a legacy system. Note that this has additional costs in the cloud as the proxy service needs to run 24x7 and only acts as a proxy. Option #1 removes the overhead of proxy services and allows a user to connect directly to the backend database. We can leverage AWS’ hosting capabilities to maintain this as an RDS instance. Option #1 is recommended and discussed below. For instructions on option #2, read our documentation link at the end.
Proposed Configuration Docs
The configuration process will use Databricks specific tools called the Databricks File System APIs. The tools allow you to create bootstrap scripts for your cluster, read and write to the underlying S3 filesystem, etc. Below is the configuration guidelines to help integrate the Databricks environment with your existing Hive Metastore.
The configurations below use a bootstrap script to install the Hive Metastore configuration to a specific cluster name, e.g. replace ${cluster-name}
with hive-test
to test central metastore connectivity.
Once tested, you can deploy the init script in the root directory to be configured for every cluster.
Configuration
%scala
dbutils.fs.put(
"/databricks/init/${cluster-name}/external-metastore.sh",
"""#!/bin/sh
|
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options for metastores in the local mode.
| # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:mysql://${mysql-host}:${mysql-port}/${metastore-db}"
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.mysql.jdbc.Driver"
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "${mysql-username}"
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "${mysql-password}"
|
| # Spark specific configuration options
| "spark.sql.hive.metastore.version" = "${hive-version}"
| # Skip this one if ${hive-version} is 0.13.x.
| "spark.sql.hive.metastore.jars" = "${hive-jar-source}"
|
| # If any of your table or database use s3 as the file system scheme,
| # please uncomment the next line to set the s3:// URL scheme to S3A file system.
| # spark.hadoop prefix is added to make sure these file system options will
| # propagate to the metastore client and Hadoop configuration.
| # "spark.hadoop.fs.s3.impl" = "com.databricks.s3a.S3AFileSystem"
|
| # If you need to use AssumeRole, please uncomment the following settings.
| # "spark.hadoop.fs.s3a.impl" = "com.databricks.s3a.S3AFileSystem"
| # "spark.hadoop.fs.s3n.impl" = "com.databricks.s3a.S3AFileSystem"
| # "spark.hadoop.fs.s3a.credentialsType" = "AssumeRole"
| # "spark.hadoop.fs.s3a.stsAssumeRole.arn" = "${sts-arn}"
|}
|EOF
|""".stripMargin,
overwrite = true
)
A Word of Caution
The configurations help bootstrap all clusters in the environment, which can lead to an outage if changes are made to the metastore environment.
- Broken network connectivity
- Changes to the underlying Hive version
- Changes to metastore credentials
- Firewall rules added to the metastore
To temporarily resolve the issue, you can use the Databricks Rest API to collect the current contents of the bootstrap script and remove it while you work on fixing the issue.
# Returns a tuple of credentials for the rest api calls
tCreds = ('[email protected]', 'myStrongPassword')
import json, pprint, requests
path_to_script = '/databricks/init/{clusterName}/external-metastore.sh'
env_url = 'https://yourenv.cloud.databricks.com'
# Helper to pretty print json
def pprint_j(i):
print json.dumps(json.loads(i), indent=4, sort_keys=True)
# Read Example
read_payload = {
'path' : path_to_script
}
resp = requests.get(env_url + '/api/2.0/dbfs/read', auth=tCreds, json = read_payload)
results = resp.content
pprint_j(results)
print resp.status_code
# Decode the base64 binary strings
print json.loads(results)['data'].decode('base64')
# Delete example
delete_payload = {
'path' : path_to_script,
'recursive' : 'false'
}
resp = requests.post(env_url + '/api/2.0/dbfs/delete', auth=tCreds, json = delete_payload)
results = resp.content
pprint_j(results)
print resp.status_code
If you want to configure clusters to connect to the Hive Metastore proxy server, you can find instructions in our Hive Metastore online guide.
What’s Next
Databricks is a very fast and easy way to start coding and analyzing data on Apache Spark. If you would like to get started with Spark and have an existing Hive Metastore you need help to integrate, you can get in touch with one of the solution architects through our Contact Us page. To get a head start, sign up for a free trial of Databricks and try out some of the live exercises we have in the documentation.
If you need help with configuring VPC peering within Databricks, check out our VPC peering documentation.