Data Engineering | Databricks Blog

Page 2

Announcing the State Reader API: The New "Statestore" Data Source

March 28, 2024 by Craig Lukasik and Jungtaek Lim in Engineering

Databricks Runtime 14.3 includes a new capability that allows users to access and analyze Structured Streaming 's internal state data: the State Reader...

PySpark in 2023: A Year in Review

March 25, 2024 by Hyukjin Kwon, Takuya Ueshin, Allison Wang, Ruifeng Zheng, Xinrong Meng, Haejoon Lee and Amanda Liu in Industries

With the releases of Apache Spark 3.4 and 3.5 in 2023, we focused heavily on improving PySpark performance, flexibility, and ease of use...

Simplify PySpark testing with DataFrame equality functions

March 6, 2024 by Haejoon Lee, Allison Wang and Amanda Liu in Engineering

The DataFrame equality test functions were introduced in Apache Spark™ 3.5 and Databricks Runtime 14.2 to simplify PySpark unit testing. The full set...

A Deep Dive into the Latest Performance Improvements of Stateful Pipelines in Apache Spark Structured Streaming

February 28, 2024 by Mojgan Mazouchi, Mrityunjay Kumar, Anish Shrigondekar and Karthikeyan Ramasamy in Engineering

This post is the second part of our two-part series on the latest performance improvements of stateful pipelines. The first part of this...

Performance Improvements for Stateful Pipelines in Apache Spark Structured Streaming

February 27, 2024 by Mojgan Mazouchi, Mrityunjay Kumar, Anish Shrigondekar and Karthikeyan Ramasamy in Engineering

Introduction Apache Spark™ Structured Streaming is a popular open-source stream processing platform that provides scalability and fault tolerance, built on top of the...

Lakehouse Monitoring: A Unified Solution for Quality of Data and AI

December 12, 2023 by Jacqueline Li, Alkis Polyzotis and Kasey Uhlenhuth in Platform

Introduction Databricks Lakehouse Monitoring allows you to monitor all your data pipelines – from data to features to ML models – without additional...

Python Dependency Management in Spark Connect

November 13, 2023 by Hyukjin Kwon and Ruifeng Zheng in Engineering

Managing the environment of an application in a distributed computing environment can be challenging. Ensuring that all nodes have the necessary environment to...

Named Arguments for SQL Functions

November 13, 2023 by Daniel Tenedorio, Xinyi Yu, Allison Wang, Wenchen Fan, Serge Rielau and Richard Yu in Engineering

Today, we introduce the new availability of named arguments for SQL functions. With this feature, you can invoke functions in more flexible ways...

Introducing Python User-Defined Table Functions (UDTFs)

November 7, 2023 by Allison Wang, Daniel Tenedorio, Takuya Ueshin and Allan Folting in Engineering

Apache Spark™ 3.5 and Databricks Runtime 14.0 have brought an exciting feature to the table: Python user-defined table functions (UDTFs). In this blog...

Arrow-optimized Python UDFs in Apache Spark™ 3.5

November 6, 2023 by Xinrong Meng, Hyukjin Kwon, Takuya Ueshin and Allan Folting in Engineering

In Apache Spark™, Python User-Defined Functions (UDFs) are among the most popular features. They empower users to craft custom code tailored to their...