Building Data Applications on the Lakehouse With the Databricks SQL Connector for Python

Published: January 26, 2022

by Bilal Aslam, Ben Fleis, Niall Egan and Shant Hovsepian

We are excited to announce General Availability of the Databricks SQL Connector for Python. This follows the recent General Availability of Databricks SQL on Amazon Web Services and Azure. Python developers can now build data applications on the lakehouse, benefiting from record-setting performance for analytics on all their data.

The native Python connector offers simple installation and a Python DB API 2.0 compatible interface that makes it easy to query data. It also automatically converts between Databricks SQL and Python data types, removing the need for boilerplate code.

In this blog post, we will run through some examples of connecting to Databricks and running queries against a sample dataset.

Simple installation from PyPI

With this native Python connector, there’s no need to download and install ODBC/JDBC drivers. Installation is through pip, which means you can include this connector in your application and use it for CI/CD as well:

Installation requires Python 3.7+

Query tables and views

The connector works with SQL endpoints as well as All Purpose Clusters. In this example, we show you how to connect to and run a query on a SQL endpoint. To establish a connection, we import the connector and pass in connection and authentication information. You can authenticate using a Databricks personal access token (PAT) or a Microsoft Azure active directory (AAD) token.

The following example retrieves a list of trips from the NYC taxi sample dataset and prints the trip distance to the console. cursor.description contains metadata about the result set in the DB-API 2.0 format . cursor.fetchall() fetches all the remaining rows as a Python list.

Output (edited for brevity):

Note: when using parameterized queries, you should carefully sanitize your input to prevent SQL injection attacks.

Insert data into tables

The connector also lets you run INSERT statements, which is useful for inserting small amounts of data (e.g. thousands of rows) generated by your Python app into tables:

Output:

To bulk load large amounts of data (e.g. millions of rows), we recommend first uploading the data to cloud storage and then executing the COPY INTO command.

Query metadata about tables and views

As well as executing SQL queries, the connector makes it easy to see metadata about your catalogs, databases, tables and columns. The following example will retrieve metadata information about columns from a sample table:

Output (edited for brevity):

A bright future for Python app developers on the lakehouse

We would like to thank the contributors to Dropbox’s PyHive connector, which provided the basis for early versions of the Databricks SQL Connector for Python. In the coming months, we plan to open-source the Databricks SQL Connector for Python and begin welcoming contributions from the community.

We are excited about what our customers will build with the Databricks SQL connector for Python. In upcoming releases, we are looking forward to adding support for additional authentication schemes, multi-catalog metadata and SQLAlchemy. Please try out the connector, and give us feedback. We would love to hear from you on what you would like us to support.