In this blog post we introduce Databricks Connect, a new library that allows you to leverage native Apache Spark APIs from any Notebook, IDE, or custom application.
Over the last several years, many custom application connectors have been written for Apache Spark. This includes tools like spark-submit, REST job servers, notebook gateways, and so on. These tools are subject to many limitations, including:
Compare this to how you would connect to a SQL database service, which just involves importing a library and connecting to a server:
The equivalent for Spark's structured data APIs would be the following:
However, prior to Databricks Connect, this above snippet would only work with single-machine Spark clusters -- preventing you from easily scaling to multiple machines or to the cloud without extra tools such as spark-submit.
Databricks Connect completes the Spark connector story by providing a universal Spark client library. This enables you to run Spark jobs from notebook apps (e.g., Jupyter, Zeppelin, CoLab), IDEs (e.g., Eclipse, PyCharm, Intellij, RStudio), and custom Python / Java applications.
What this means is that anywhere you can "import pyspark" or "import org.apache.spark", you can now seamlessly run large-scale jobs against Databricks clusters. As an example, we show a CoLab notebook executing Spark jobs remotely using Databricks Connect. It is important to notice that there is no application-specific integration here---we just installed the databricks-connect library and imported it. We're also reading an S3 dataset from GCP, which is possible since the Spark cluster itself is hosted in an AWS region:
Jobs launched from Databricks Connect run remotely on Databricks clusters to leverage their distributed compute, and can be monitored using the Databricks Spark UI:
More than one hundred customers are actively using Databricks Connect today. Some of the notable use cases include:
Development & CI/CD:
Interactive analytics:
Application development:
To build a universal client library, we had to satisfy the following requirements:
To meet these requirements, when the application uses Spark APIs, the Databricks Connect library runs the planning of the job all the way up to the analysis phase. This enables the Databricks Connect library to behave identically to Spark (requirement 1). When the job is ready to be executed, Databricks Connect sends the logical query plan over to the server, where actual physical execution and IO occurs (requirement 2):
Figure 1. Databricks Connect divides the lifetime of Spark jobs into a client phase, which includes up to logical analysis, and server phase, which performs execution on the remote cluster.
The Databricks Connect client is designed to work well across a variety of use cases. It communicates to the server over REST, making authentication and authorization straightforward through platform API tokens. Sessions are isolated between multiple users for secure, high concurrency sharing of clusters. Results are streamed back in an efficient binary format to enable high-performance. The protocol used is stateless, which means that you can easily build fault-tolerant applications and won't lose work even if the clusters are restarted.
Databricks Connect enters general availability starting with the DBR 5.4 release, and has support for Python, Scala, Java, and R workloads. You can get it from PyPI for all languages with "pip install databricks-connect", and documentation is available here.