Skip to main content
Company Blog

Announcing Support for Google BigQuery in Databricks Runtime 7.1

by

July 31, 2020 in Company Blog

Share this post

At Databricks, we are building a unified platform for data and AI. Data in enterprises lives in many locations, and Databricks excels at unifying data wherever it may reside. Today, we are happy to announce support for reading and writing data in Google BigQuery within Databricks Runtime 7.1.

Introduction to BigQuery

In Google’s own words, “BigQuery is a serverless, highly scalable and cost-effective data warehouse designed for business agility.” BigQuery is a popular choice for analyzing data stored on the Google Cloud Platform. Under the covers, BigQuery is a columnar data warehouse with separation of compute and storage. It also supports ANSI:2011 SQL, which makes it a useful choice for big data analytics.

Enhancements for Databricks users

The Spark data source included in Databricks Runtime 7.1 is a fork of Google’s open-source spark-bigquery-connector that makes it easy to work with BigQuery from Databricks:

  • Reduced data transfer and faster queries: Databricks automatically pushes down certain query predicates, e.g., filtering on nested columns to BigQuery to speed up query processing and reduce data transfer. These optimizations are automatically applied to your queries.
  • Direct query: Transforming and filtering the data residing in a BigQuery table using existing Spark APIs can first mean transferring large amounts of data from BigQuery to Databricks. To reduce data transfer costs, we have added the capability to first run a SQL query on BigQuery with the query() API and only transfer the resulting data set.

Examples

The following examples show how easy it is for BigQuery users to get started with Databricks.

Read the results of a BigQuery SQL query into a DataFrame

val table = "bigquery-public-data.samples.shakespeare"
val tempLocation = "databricks_testing"

// read the entire table into a DataFrame
val df1 = spark.read.format("bigquery").option("table", table).load()

// read the result of a BigQuery SQL query into a DataFrame
val df2 =
	spark.read.format("bigquery")
	.option("materializationDataset", tempLocation)
	.option("query", s"SELECT count(1) FROM `${table}`")
	.load()
	.collect()

Write a DataFrame to a BigQuery table

df.write
    .format("bigquery")
    .mode("append")
    .option("temporaryGcsBucket", tempLocation)
    .option("table", "mycompany.employees")
    .save()

Use cases

Support for BigQuery will enable new use cases, including these examples that our customers are already building:

  • Advanced analytics and machine learning on data stored in Google Cloud: Take advantage of the power of Databricks’ collaborative data science environment to supercharge the productivity of your data teams. You can also standardize the ML lifecycle from experimentation to production and enable ML and AI on data in Google Cloud.
  • Multi-cloud data integration: If part of your data resides in Google Cloud, you can use Databricks to bring together data silos and unlock the full value of your data.

See the documentation for detailed information on how to get started.