Skip to main content
Page 1

PySpark in 2023: A Year in Review

With the releases of Apache Spark 3.4 and 3.5 in 2023, we focused heavily on improving PySpark performance, flexibility, and ease of use...

Arrow-optimized Python UDFs in Apache Spark™ 3.5

In Apache Spark™, Python User-Defined Functions (UDFs) are among the most popular features. They empower users to craft custom code tailored to their...

Introducing Apache Spark™ 3.4 for Databricks Runtime 13.0

Today, we are happy to announce the availability of Apache Spark™ 3.4 on Databricks as part of Databricks Runtime 13.0 . We extend...

Memory Profiling in PySpark

There are many factors in a PySpark program's performance. PySpark supports various profiling tools to expose tight loops of your program and allow...

How to Profile PySpark

In Apache Spark™, declarative Python APIs are supported for big data workloads. They are powerful enough to handle most common use cases. Furthermore...

Pandas API on Upcoming Apache Spark™ 3.2

October 4, 2021 by Hyukjin Kwon and Xinrong Meng in
We're thrilled to announce that the pandas API will be part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible...

Benchmark: Koalas (PySpark) and Dask

Koalas is a data science library that implements the pandas APIs on top of Apache Spark so data scientists can use their favorite...

Python Autocomplete Improvements for Databricks Notebooks

At Databricks, we strive to provide a world-class development experience for data scientists and engineers, and new features are constantly getting added to...