As organizations are maturing their data infrastructure and accumulating more data than ever before in their data lakes, Open and Reliable table formats such as Delta Lake become a critical necessity.
Thousands of companies are already using Delta Lake in production, and open-sourcing all of Delta Lake (as announced in June 2022) has further increased its adoption across various domains and verticals.
Since many of those companies are using both Databricks and other data and AI frameworks (e.g., Power BI, Trino, Flink, Spark on Kubernetes) as part of their tech stack, it’s crucial for them to be able to read and write from/to Delta Lake using all those frameworks.
The goal of this blog post is to help these users do so, as seamlessly as possible.
Databricks provides multiple options to read data from and write data to the lakehouse. These options vary from each other on various parameters. Each of these options match different use cases.
The parameters we use to evaluate these options are:
Access the files directly on the cloud storage. External tables (AWS/Azure/GCP) in Databricks Unity Catalog (UC) can be accessed directly using the path of the table. That requires the client to store the path, have a networking path to the storage, and have permission to access the storage directly.
a. Pros
b. Cons
c. Flow:
In this scenario, we sync the metadata in Unity Catalog with an external Hive Metastore (HMS), such as Glue, on a regular basis. We keep one or more databases in sync with the external directory. This will allow a client using a Hive-supported reader to access the table. Similarly to the previous solution it requires the client to have direct access to the storage.
a. Pros
b. Cons
c. Flow:
Access Delta tables via Delta Sharing (read more about Delta Sharing here).
The data provider creates a share for existing Delta tables, and the data recipient can access the data defined within the share configuration. The Shared data is kept up to date and supports real time/near real time use cases including streaming.
Generally speaking, the data recipient connects to a Delta Sharing server, via a Delta Sharing client (that’s supported by a variety of tools). A Delta sharing client is any tool that supports direct read from a Delta Sharing source. A signed URL is then provided to the Delta Sharing client, and the client uses it to access the Delta table storage directly and read only the data they’re allowed to access.
On the data provider end, this approach removes the need to manage permissions on the storage level and provides certain audit capabilities (on the share level).
On the data recipient end, the data is consumed using one of the aforementioned tools, which means the recipient also needs to handle the compute scalability on their own (e.g., using a Spark cluster).
a. Pros
b. Cons
i. Flow:
The JDBC/ODBC connector allows you to connect your backend application, using JDBC/ODBC, to a Databricks SQL warehouse (as described here).
This essentially is no different than what you’d normally do when connecting backend applications to a database.
Databricks and some third party developers provide wrappers for the JDBC/ODBC connector that allow direct access from various environments, including:
This solution is suitable for standalone clients, as the computing power is the Databricks SQL warehouse (hence the compute scalability is handled by Databricks).
As opposed to the Delta Sharing approach, the JDBC/ODBC connector approach also allows you to write data to Delta tables (it even supports concurrent writes).
a. Pros
b. Cons
l. Workflow:
c. Note that if you have Unity Catalog enabled on your workspace, you also get full governance and audit of the operations. You can still use the approach described above without Unity Catalog, but governance and auditing will be limited.
d. This is the only option that supports row level filtering and column filtering.
This chart demonstrates the match of the above described solution alternatives with a select list of common use cases. They are rated 0-4:
Review Documentation
https://docs.databricks.com/integrations/jdbc-odbc-bi.html
https://www.databricks.com/product/delta-sharing
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-sync.html