With the releases of Apache Spark 3.4 and 3.5 in 2023, we focused heavily on improving PySpark performance, flexibility, and ease of use. This blog post walks you through the key improvements.
Here's a rundown of some of the most important features added in Apache Spark 3.4 and 3.5 in 2023:
In the following section, we'll examine each of these and provide pointers to some additional notable improvements.
Spark Connect debuted in Apache Spark 3.4, introducing a decoupled client-server architecture that enables remote connectivity to Spark clusters from any application running anywhere. This separation of the client and server allows modern data applications, IDEs, notebooks, and programming languages to access Spark interactively. Furthermore, the decoupled architecture improves stability, upgradability, debuggability, and observability.
In Apache Spark 3.5, Scala support was completed, as well as support for major Spark components such as Structured Streaming (SPARK-42938), ML and PyTorch (SPARK-42471), and the Pandas API on Spark (SPARK-42497).
Use Databricks Connect to get started with Spark Connect on Databricks or Spark Connect directly for Apache Spark.
Arrow-optimized Python UDFs (SPARK-40307) enable substantial performance optimizations by leveraging the Arrow columnar format. For example, when chaining UDFs in the same cluster, Arrow-optimized Python UDFs execute ~1.9 times faster than pickled Python UDFs on a 32 GB dataset.
In Apache Spark 3.5, we extended PySpark's UDF support with user-defined table functions, which return a table as output instead of a single scalar result value. Once registered, they can appear in the FROM clause of a SQL query. For example, the UDTF SquareNumbers
outputs the inputs and their squared values as a table:
One of the major benefits of PySpark is that Spark SQL works seamlessly with PySpark DataFrames. In 2023, Spark SQL introduced many new features that PySpark can leverage directly via spark.sql,
such as GROUP BY ALL and ORDER BY ALL,
general table-valued function support, INSERT BY NAME, PIVOT
and MELT
, ANSI compliance, and more. Here's an example of using GROUP BY ALL
and ORDER BY ALL
:
Python arbitrary stateful operations in Structured Streaming unblock a massive number of real-time analytics and machine learning use cases in PySpark by allowing state processing across streaming query executions. The following example demonstrates arbitrary stateful processing:
TorchDistributor provides native support in PySpark for PyTorch, which enables distributed training of deep learning models on Spark clusters. It starts the PyTorch processes and leaves it to PyTorch to work out the distribution mechanisms, acting just to ensure that the processes are coordinated.
TorchDistributor is simple to use, with a few main settings to consider:
The new testing API in the pyspark.testing
package (SPARK-44042) brings significant enhancements for developers testing PySpark applications. It provides utility functions for equality tests, complete with detailed error messages, making identifying discrepancies in DataFrame schemas and data easier. The example output below illustrates:
The English SDK for Apache Spark simplifies its use by enabling users to input commands in plain English and then convert them into PySpark and Spark SQL code. This makes PySpark programming more accessible, especially for code related to DataFrame transformation operations, data ingestion, and UDFs, and thanks to caching it further boosts productivity. The English SDK has great potential to streamline development processes, minimize code complexity, and expand the Spark community's reach. Try it out yourself!
Here are some of the other features introduced in Apache Spark 3.4 and 3.5 that you might want to explore if you aren't familiar with them already:
In 2023, vibrant innovation from the open-source community significantly enriched both PySpark and Apache Spark, broadening the toolkits available for data professionals and streamlining analytics workflows. With Apache Spark 4.0 on the horizon, PySpark is poised to further revolutionize data processing through new features and enhanced performance, reaffirming its commitment to advancing data analytics within the data engineering and data science community.
This post provided a quick overview of the most significant improvements made in Apache Spark 3.4 and 3.5 in 2023 to enhance the ease of use, performance, and flexibility of PySpark. All of these features are available in Databricks Runtime 13 and 14—why not try some of them out for yourself today?