Databricks Labs
Databricks Labs are projects created by the field team to help customers get their use cases into production faster!
DQX
Simplified Data Quality checking at Scale for PySpark Workloads on streaming and standard DataFrames.
UCX
UCX is a toolkit for enabling Unity Catalog (UC) in your Databricks workspace. UCX provides commands and workflows for migrate tables and views to UC. UCX allows to rewrite dashboards, jobs and notebooks to use the migrated data assets in UC. And there are many more features.
Mosaic
Mosaic is a tool that simplifies the implementation of scalable geospatial data pipelines by binding together common open source geospatial libraries and Apache Spark™️. Mosaic also provides a set of examples and best practices for common geospatial use cases. It provides APIs for ST_ expressions and GRID_ expressions, supporting grid index systems such as H3 and British National Grid.
Other Projects
Overwatch
Analyze all of your jobs and clusters across all of your workspaces to quickly identify where you can make the biggest adjustments for performance gains and cost savings.
Splunk Integration
Add-on for Splunk, an app that allows Splunk Enterprise and Splunk Cloud users to run queries and execute actions, such as running notebooks and jobs, in Databricks.
Smolder
Smolder provides an Apache Spark™ SQL data source for loading EHR data from HL7v2 message formats. Additionally, Smolder provides helper functions that can be used on a Spark SQL DataFrame to parse HL7 message text, and to extract segments, fields, and subfields from a message.
Geoscan
Apache Spark ML Estimator for density-based spatial clustering based on Hexagonal Hierarchical Spatial Indices.
Migrate
Tool to help customers migrate artifacts between Databricks workspaces. This allows customers to export configurations and code artifacts as a backup or as part of a migration between a different workspace.
Github Sources
Learn more: AWS | Azure
Data Generator
Generate relevant data quickly for your projects. The Databricks data generator can be used to generate large simulated/synthetic data sets for test, POCs, and other uses
DeltaOMS
Centralized Delta transaction log collection for metadata and operational metrics analysis on your Lakehouse.
DLT-META
This framework makes it easy to ingest data using Delta Live Tables and metadata. With DLT-META, a single data engineer can easily manage thousands of tables. Several Databricks customers have DLT-META in production to process 1000+ tables.
DiscoverX
DiscoverX automates administration tasks that require inspecting or applying operations to a large number of Lakehouse assets.
brickster
{brickster} is the R toolkit for Databricks, it includes:
- Wrappers for Databricks API's (e.g. db_cluster_list, db_volume_read)
- Browser workspace assets via RStudio Connections Pane (open_workspace())
- Exposes the databricks-sql-connector via {reticulate} (docs)
- Interactive Databricks REPL
DBX
This tool simplifies jobs launch and deployment process across multiple environments. It also helps to package your project and deliver it to your Databricks environment in a versioned fashion. Designed in a CLI-first manner, it is built to be actively used both inside CI/CD pipelines and as a part of local tooling for fast prototyping.
Tempo
The purpose of this project is to provide an API for manipulating time series on top of Apache Spark™. Functionality includes featurization using lagged time values, rolling statistics (mean, avg, sum, count, etc.), AS OF joins, and downsampling and interpolation. This has been tested on TB-scale of historical data.
PyLint Plugin
This plugin extends PyLint with checks for common mistakes and issues in Python code specifically in Databricks Environment.
PyTester
PyTester is a powerful way to manage test setup and teardown in Python. This library provides a set of fixtures to help you write integration tests for Databricks.
Delta Sharing Java Connector
The Java connector follows the Delta Sharing protocol to read shared tables from a Delta Sharing Server. To further reduce and limit egress costs on the Data Provider side, we implemented a persistent cache to reduce and limit the egress costs on the Data Provider side by removing any unnecessary reads.
Please note that all projects in the https://github.com/databrickslabs account are provided for your exploration only, and are not formally supported by Databricks with service level agreements (SLAs). They are provided AS IS and we do not make any guarantees of any kind. Any issues discovered through the use of these projects can be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for GitHub support. If you are a customer with a current Databricks Support Services contract, you may submit a support ticket relating to issues arising from the use of these projects, request how-to assistance, and request help triaging the root cause of such issues. Project issues found to originate with Databricks Platform Services will be handled per the Databricks Support Policy. For issues determined to originate with the project, Databricks will in its sole discretion provide such support as it deems reasonable and appropriate.