Skip to main content
Page 1
Engineering blog

Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering

We are excited to announce Delta Lake 3.0, the next major release of the Linux Foundation open source Delta Lake Project, available in...
Platform blog

Faster MERGE Performance With Low-Shuffle MERGE and Photon

At Databricks, one of our key goals is to provide our customers with an industry-best price/performance experience out of the box. From ETL...
Engineering blog

Get Your Free Copy of Delta Lake: The Definitive Guide (Early Release)

At the Data + AI Summit, we were thrilled to announce the early release of Delta Lake: The Definitive Guide , published by...
Engineering blog

Automatically Evolve Your Nested Column Schema, Stream From a Delta Table Version, and Check Your Constraints

We recently announced the release of Delta Lake 0.8.0 , which introduces schema evolution and performance improvements in merge and operational metrics in...
Engineering blog

Diving Into Delta Lake: DML Internals (Update, Delete, Merge)

September 29, 2020 by Tathagata Das and Brenner Heintz in Engineering Blog
In previous blogs Diving Into Delta Lake: Unpacking The Transaction Log and Diving Into Delta Lake: Schema Enforcement & Evolution , we described...
Engineering blog

Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0

Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake. Last week, we had...
Engineering blog

Schema Evolution in Merge Operations and Operational Metrics in Delta Lake

Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake. Try this notebook to...
Engineering blog

Query Delta Lake Tables from Presto and Athena, Improved Operations Concurrency, and Merge performance

January 29, 2020 by Tathagata Das and Denny Lee in Engineering Blog
Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake. We are excited to...
Engineering blog

Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python APIs

October 3, 2019 by Tathagata Das and Denny Lee in Engineering Blog
We are excited to announce the release of Delta Lake 0.4.0 which introduces Python APIs for manipulating and managing data in Delta tables...
Engineering blog

Announcing the Delta Lake 0.3.0 Release

Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake. We are excited to...
Company blog

Efficient Upserts into Data Lakes with Databricks Delta

Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake. Simplify building big data...
Company blog

Introducing Databricks Optimized Autoscaling on Apache Spark™

Databricks is thrilled to announce our new optimized autoscaling feature. The new Apache Spark™-aware resource manager leverages Spark shuffle and executor statistics to...
Engineering blog

Introducing Low-latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3

Import this notebook on Databricks Structured Streaming in Apache Spark 2.0 decoupled micro-batch processing from its high-level APIs for a couple of reasons...
Engineering blog

Introducing Stream-Stream Joins in Apache Spark 2.3

Since we introduced Structured Streaming in Apache Spark 2.0 , it has supported joins (inner join and some type of outer joins) between...
Company blog

Do your Streaming ETL at Scale with Apache Spark’s Structured Streaming

September 1, 2017 by Tathagata Das in Company Blog
At the Spark Summit in San Francisco in June , we announced that Apache Spark’s Structured Streaming is marked as production-ready and shared...
Engineering blog

Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming

This is the fourth post in a multi-part series about how you can perform complex streaming analytics using Apache Spark. Continuous applications often...
Engineering blog

Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1

In part 1 of this series on Structured Streaming blog posts, we demonstrated how easy it is to write an end-to-end streaming ETL...
Engineering blog

Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1

Explore why lakehouses are the data architecture of the future with the father of the data warehouse, Bill Inmon. Try this notebook in...
Engineering blog

Spark Structured Streaming

Apache Spark 2.0 adds the first version of a new higher-level API, Structured Streaming, for building continuous applications . The main goal is...
Engineering blog

Faster Stateful Stream Processing in Apache Spark Streaming

February 1, 2016 by Tathagata Das and Shixiong Zhu in Engineering Blog
Many complex stream processing pipelines must maintain state across a period of time. For example, if you are interested in understanding user behavior...
Engineering blog

Diving into Apache Spark Streaming's Execution Model

With so many distributed stream processing engines available, people often ask us about the unique benefits of Apache Spark Streaming . From early...
Engineering blog

New Visualizations for Understanding Apache Spark Streaming Applications

Earlier, we presented new visualizations introduced in Apache Spark 1.4.0 to understand the behavior of Spark applications. Continuing the theme, this blog highlights...
Engineering blog

Improvements to Kafka integration of Spark Streaming

Apache Kafka is rapidly becoming one of the most popular open source stream ingestion platforms. We see the same trend among the users...
Engineering blog

Improved Fault-tolerance and Zero Data Loss in Apache Spark Streaming

January 15, 2015 by Tathagata Das in Engineering Blog
Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its...
Engineering blog

Apache Spark 1.1: The State of Spark Streaming

With Apache Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components - Spark...
Engineering blog

Apache Spark 0.9.1 Released

April 9, 2014 by Tathagata Das in Engineering Blog
We are happy to announce the availability of Apache Spark 0.9.1 ! This is a maintenance release with bug fixes, performance improvements, better...