Articles by Hyukjin Kwon - Databricks Blog

Page 1

PySpark in 2023: A Year in Review

March 25, 2024 by Hyukjin Kwon, Takuya Ueshin, Allison Wang, Ruifeng Zheng, Xinrong Meng, Haejoon Lee and Amanda Liu in Industries

With the releases of Apache Spark 3.4 and 3.5 in 2023, we focused heavily on improving PySpark performance, flexibility, and ease of use...

Parameterized queries with PySpark

January 3, 2024 by Matthew Powers, Daniel Tenedorio and Hyukjin Kwon in Engineering Blog

PySpark has always provided wonderful SQL and Python APIs for querying data. As of Databricks Runtime 12.1 and Apache Spark 3.4, parameterized queries...

Python Dependency Management in Spark Connect

November 14, 2023 by Hyukjin Kwon and Ruifeng Zheng in Engineering Blog

Managing the environment of an application in a distributed computing environment can be challenging. Ensuring that all nodes have the necessary environment to...

Arrow-optimized Python UDFs in Apache Spark™ 3.5

November 6, 2023 by Xinrong Meng, Hyukjin Kwon, Takuya Ueshin and Allan Folting in Engineering Blog

In Apache Spark™, Python User-Defined Functions (UDFs) are among the most popular features. They empower users to craft custom code tailored to their...

Introducing Apache Spark™ 3.5

September 15, 2023 by Yuanjian Li, Daniel Tenedorio, Martin Grund, Allan Folting, Hyukjin Kwon, Herman van Hövell, Wenchen Fan, Weichen Xu, Gengliang Wang, Allison Wang, Jungtaek Lim, Xiao Li and Reynold Xin in Engineering Blog

Today, we are happy to announce the availability of Apache Spark™ 3.5 on Databricks as part of Databricks Runtime 14.0. We extend our...

Spark Connect Available in Apache Spark 3.4

April 18, 2023 by Allan Folting, Hyukjin Kwon, Xiao Li, Herman van Hövell, Stefania Leone, Martin Grund, Reynold Xin and Kris Mo in Engineering Blog

Last year Spark Connect was introduced at the Data and AI Summit. As part of the recently released Apache SparkTM 3.4, Spark Connect...

Introducing Apache Spark™ 3.4 for Databricks Runtime 13.0

April 14, 2023 by Xinrong Meng, Daniel Tenedorio, Martin Grund, Allan Folting, Hyukjin Kwon, Herman van Hövell, Wenchen Fan, Ying Xiong, Jungtaek Lim, Xiao Li and Reynold Xin in Engineering Blog

Today, we are happy to announce the availability of Apache Spark™ 3.4 on Databricks as part of Databricks Runtime 13.0 . We extend...

Python Arbitrary Stateful Processing in Structured Streaming

October 18, 2022 by Hyukjin Kwon and Jungtaek Lim in Engineering Blog

More and more customers are using Databricks for their real-time analytics and machine learning workloads to meet the ever increasing demand of their...

How to Profile PySpark

October 6, 2022 by Xinrong Meng, Takuya Ueshin, Hyukjin Kwon and Allan Folting in Engineering Blog

In Apache Spark™, declarative Python APIs are supported for big data workloads. They are powerful enough to handle most common use cases. Furthermore...

Introducing Apache Spark™ 3.3 for Databricks Runtime 11.0

June 15, 2022 by Maxim Gekk, Wenchen Fan, Hyukjin Kwon, Serge Rielau, Yingyi Bu, Xiao Li and Reynold Xin in Open Source

Today we are happy to announce the availability of Apache Spark™ 3.3 on Databricks as part of Databricks Runtime 11.0 . We want...

How to Monitor Streaming Queries in PySpark

May 27, 2022 by Hyukjin Kwon, Karthik Ramasamy and Alexander Balikov in Open Source

Streaming is one of the most important data processing techniques for ingestion and analysis. It provides users and developers with low latency and...

Introducing Apache Spark™ 3.2

October 19, 2021 by Gengliang Wang, Wenchen Fan, Hyukjin Kwon, Xiao Li and Reynold Xin in Engineering Blog

We are excited to announce the availability of Apache Spark™ 3.2 on Databricks as part of Databricks Runtime 10.0 . We want to...

Pandas API on Upcoming Apache Spark™ 3.2

October 4, 2021 by Hyukjin Kwon and Xinrong Meng in Engineering Blog

We're thrilled to announce that the pandas API will be part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible...

Benchmark: Koalas (PySpark) and Dask

April 7, 2021 by Xinrong Meng and Hyukjin Kwon in Engineering Blog

Koalas is a data science library that implements the pandas APIs on top of Apache Spark so data scientists can use their favorite...

Introducing Apache Spark™ 3.1

March 2, 2021 by Hyukjin Kwon, Wenchen Fan, Xiao Li and Reynold Xin in Engineering Blog

We are excited to announce the availability of Apache Spark 3.1 on Databricks as part of Databricks Runtime 8.0 . We want to...

How to Manage Python Dependencies in PySpark

December 22, 2020 by Hyukjin Kwon in Engineering Blog

Controlling the environment of an application is often challenging in a distributed computing environment - it is difficult to ensure all nodes have...

Python Autocomplete Improvements for Databricks Notebooks

December 15, 2020 by Richard Fung, Xinrong Meng, Takuya Ueshin, Hyukjin Kwon and Austin Ford in Engineering Blog

At Databricks, we strive to provide a world-class development experience for data scientists and engineers, and new features are constantly getting added to...

An Update on Project Zen: Improving Apache Spark for Python Users

September 4, 2020 by Hyukjin Kwon and Matei Zaharia in Engineering Blog

Apache Spark™ has reached its 10th anniversary with Apache Spark 3.0 which has many significant improvements and new features including but not limited...

Interoperability between Koalas and Apache Spark

August 11, 2020 by Takuya Ueshin, Hyukjin Kwon and Xiao Li in Engineering Blog

Koalas is an open source project which provides a drop-in replacement for pandas, enabling efficient scaling out to hundreds of worker nodes for...

A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0

July 22, 2020 by Maxim Gekk, Wenchen Fan and Hyukjin Kwon in Engineering Blog

Apache Spark is a very popular tool for processing structured and unstructured data. When it comes to processing structured data, it supports many...

Introducing Koalas 1.0

June 24, 2020 by Hyukjin Kwon, Takuya Ueshin and Xiao Li in Company Blog

Koalas was first introduced last year to provide data scientists using pandas with a way to scale their existing big data workloads by...

Vectorized R I/O in Upcoming Apache Spark 3.0

June 1, 2020 by Hyukjin Kwon in Engineering Blog

R is one of the most popular computer languages in data science, specifically dedicated to statistical analysis with a number of extensions, such...

New Pandas UDFs and Python Type Hints in the Upcoming Release of Apache Spark 3.0

May 20, 2020 by Hyukjin Kwon in Engineering Blog

Pandas user-defined functions (UDFs) are one of the most significant enhancements in Apache Spark TM for data science. They bring many benefits, such...

10 Minutes from pandas to Koalas on Apache Spark

March 31, 2020 by Haejoon Lee, Yifan Cao, Hyukjin Kwon and Takuya Ueshin in Engineering Blog

This is a guest community post from Haejoon Lee, a software engineer at Mobigen in South Korea and a Koalas contributor. pandas is...