Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake.
Bringing unprecedented reliability and performance to cloud data lakes
Designed by Databricks in collaboration with Microsoft, Azure Databricks combines the best of Databricks’ Apache SparkTM-based cloud service and Microsoft Azure. The integrated service provides the Databricks Unified Analytics Platform integrated with the Azure cloud platform, encompassing the Azure Portal; Azure Active Directory; and other data services on Azure, including Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Storage; and Microsoft Power BI.
Databricks Delta, a component of Azure Databricks, addresses the data reliability and performance challenges of data lakes by bringing unprecedented data reliability and query performance to cloud data lakes. It is a unified data management system that delivers ML readiness for both batch and stream data at scale while simplifying the underlying data analytics architecture.
Further, it is easy to port code to use Delta. With today’s public preview, Azure Databricks Premium customers can start using Delta straight away. They can start benefiting from the acceleration that large reliable datasets can provide to their ML efforts. Others can try it out using the Azure Databricks 14 day trial.
Common Data Lake Challenges
Many organizations have responded to their ever-growing data volumes by adopting data lakes as places to collect their data ahead of making it available for analysis. While this has tended to improve the situation somewhat data lakes also present some key challenges:
Query performance - The required ETL processes can add significant latency such that it may take hours before incoming data manifests in a query response so the users do not benefit from the latest data. Further, increasing scale and the resulting longer query run times can prove unacceptably long for users.
Data reliability - The complex data pipelines are error-prone and consume inordinate resources. Further, schema evolution as business needs change can be effort-intensive. Finally, errors or gaps in incoming data, a not uncommon occurrence, can cause failures in downstream applications.
System complexity - It is difficult to build flexible data engineering pipelines that combine streaming and batch analytics. Building such systems requires complex and low-level code. Interventions during stream processing with batch correction or programming multiple streams from the same sources or to the same destinations is restricted.
Databricks Delta To The Rescue
Already in use by several customers (handling more than 300 billion rows and more than 100 TB of data per day) as part of a private preview, today we are excited to announce Databricks Delta is now entering Public Preview status for Microsoft Azure Databricks Premium customers, expanding its reach to many more.
Using an innovative new table design, Delta supports both batch and streaming use cases with high query performance and strong data reliability while requiring a simpler data pipeline architecture:
Increased query performance - Able to deliver 10 to 100 times faster performance than Apache Spark(™) on Parquet through the use of key enablers such as compaction, flexible indexing, multi-dimensional clustering and data caching.
Improved data reliability - By employing ACID (“all or nothing”) transactions, schema validation / enforcement, exactly once semantics, snapshot isolation and support for UPSERTS and DELETES.
Reduced system complexity - Through the unification of batch and streaming in a common pipeline architecture - being able to operate on the same table also means a shorter time from data ingest to query result. Schema evolution provides the ability to infer schema from input data making it easier to deal with changing business needs.
The Versatility of Delta
Delta can be deployed to help address a myriad of use cases including IoT, clickstream analytics and cyber security. Indeed, some of our customers are already finding value with Delta for these - I hope to share more on that in future posts. My colleagues have written a blog (Simplify Streaming Stock Data Analysis Using Databricks Delta) to showcase Delta that you might interesting.
Easy to Adopt: Check Out Delta Today
Porting existing Spark code for using Delta is as simple as changing
“CREATE TABLE ... USING parquet” to
“CREATE TABLE ... USING delta”
or changing
“dataframe.write.format("parquet").load("/data/events")”
“dataframe.write.format("delta").load("/data/events")”
If you are already using Azure Databricks Premium you can explore Delta today using:
- Azure Databricks Delta Quickstart for an introduction to Databricks Delta
- Optimizing Performance and Cost for a discussion of features such as compaction, z-ordering and data skipping.
Both of these contain notebooks in Python, Scala and SQL that you can use to try Delta.
If you are not already using Databricks, you can try Databricks Delta for free by signing up for the free Azure Databricks 14 day trial.
You can learn more about Delta from the Databricks Delta documentation.
Visit the Delta Lake online hub to learn more, download the latest code and join the Delta Lake community.