Skip to main content
Page 1
Industries category icon 1

PySpark in 2023: A Year in Review

With the releases of Apache Spark 3.4 and 3.5 in 2023, we focused heavily on improving PySpark performance, flexibility, and ease of use...
Engineering blog

Arrow-optimized Python UDFs in Apache Spark™ 3.5

In Apache Spark™, Python User-Defined Functions (UDFs) are among the most popular features. They empower users to craft custom code tailored to their...
Engineering blog

Introducing Apache Spark™ 3.4 for Databricks Runtime 13.0

Today, we are happy to announce the availability of Apache Spark™ 3.4 on Databricks as part of Databricks Runtime 13.0 . We extend...
Engineering blog

Memory Profiling in PySpark

There are many factors in a PySpark program's performance. PySpark supports various profiling tools to expose tight loops of your program and allow...
Engineering blog

How to Profile PySpark

In Apache Spark™, declarative Python APIs are supported for big data workloads. They are powerful enough to handle most common use cases. Furthermore...
Engineering blog

Pandas API on Upcoming Apache Spark™ 3.2

October 4, 2021 by Hyukjin Kwon and Xinrong Meng in Engineering Blog
We're thrilled to announce that the pandas API will be part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible...
Engineering blog

Benchmark: Koalas (PySpark) and Dask

Koalas is a data science library that implements the pandas APIs on top of Apache Spark so data scientists can use their favorite...
Engineering blog

Python Autocomplete Improvements for Databricks Notebooks

At Databricks, we strive to provide a world-class development experience for data scientists and engineers, and new features are constantly getting added to...