Project Zen: Making Data Science Easier in PySpark

May 27, 2021 03:15 PM (PT)

The number of PySpark users has increased dramatically, and Python has become one of the most commonly used languages in data science. In order to cater to the increasing number of Python users and improve Python usability in Apache Spark, Apache Spark initiated Project Zen named after “The Zen of Python” which defines the principles of Python.

Project Zen started with newly redesigned pandas UDFs and function APIs with Python type hints in Apache Spark 3.0. The Spark community has since then, introduced numerous improvements as part of Project Zen in Apache Spark 3.1 and the upcoming apache Spark 3.2 that includes:

  • Python type hints
  • New documentation
  • Conda, venv and PEX
  • numpydoc docstring
  • pandas APIs on Spark
  • Visualization

In this talk, we will present the improvements and features in Project Zen with demonstration to show how Project Zen makes data science easier with the improved usability.

In this session watch:
Hyukjin Kwon, Software Engineer, Databricks
Haejoon Lee, Software Engineer, Databricks

 

Transcript

Hyukjin Kwon: Hello. I’m Hyukjin Kwon with Haejoon Lee. We’re both Databricks software engineers. And then, we are going to talk about the Project Zen today. Project Zen is the effort to make PySpark more Pythonic and easier to use. So this project was started early in Spark 3.1 development. And this talk will show how the Project Zen makes the data science easier with PySpark.
So I’m a software engineer at Databricks. And I’m also a member of the Apache Spark PMC and committer, and also maintain Koalas. Haejoon also works for Databricks, and then he also maintains Koalas together. We both are fairly active in Spark development.
All right. So before we start to talk about Project Zen, I will explain a bit of background first. So the first figure shows that the trend of the questions in stack overflow and then Python-related question have rapidly increased. The second is the stats of quotes in GitHub, and that the Python took the second place. Python has grown rapidly in last few years, and then it became one of the dominant program language.
Apache Spark community took that such kind of trend seriously. And then, we initiate the Project Zen to increase the Python usability in Apache Spark so it provides a better user experience to the Python users. The goal of this project is pretty simple. It’s just to make the PySpark more Pythonic and easier to use with working better with other Python libraries.
So this is the agenda of this talk today. So I’m going to briefly go through what’s done in Spark 3.1 first. So firstly, Python type hints in PySpark that allows a bunch of cool things like auto-completion and the newly designed PySpark documentation that improve the readability with a lots of useful pages. And lastly, the Python dependency management with Conda, Virtualenv, and PEX. And then, Haejoon will continue what’s done in the upcoming Spark 3.2, that is pandas APIs on Spark, and visualizations like plot.
All right. So let me start with the Python type hints in Spark 3.1 first. Python type hints are the official way of annotating types in Python. The example here shows the before and after the PySpark type hints. It has a function called greeting that takes a string and outputs a string. With the type hints, you can know what’s expected as an input and output easily.
So the type hints make the codes easier to read. This is actually not only the benefits. The Python type hints can be effectively useful in IDEs and Notebooks that the data scientists use in their daily work. Without the type hints, IDEs or Notebook cannot support the auto-completion very well. For example, without the type hints, you trigger the auto-completion for Spark context and it shows random attribute and method. It doesn’t show what APIs it has. This is not the problem in IDEs, but more because of the Python stock typing. In Spark 3.1, we edited the full support and full support of the Python types so it provides a complete and auto-completion.
So like IDEs, Notebook now has a full auto-completion support. Now in Notebooks like Databricks or Jupyter, you press tab and it shows the full list of APIs available. Beside the auto-completion, Python type hints can also automatically document input and output types in API documentation. This improves the development efficiency. The developers don’t have to take care of the documenting input and output types, and users can expect a consistent documentation for arguments and written types.
If you’re already using Spark 3.1 with IDEs, you might have noticed that your IDE detects such errors statically. It warns when the types are mismatching. Because of the doc typing, the IDEs couldn’t detect such errors before, but now PySpark has the type hints the IDE can understand. So IDE can enable such static error detection out of the box. It prevents the developers to make mistake like using wrong types for input and output.
Another thing of the cool thing in PySpark, the 3.1, is the newly designed documentation. So here’s what you see now in Spark 3.1 PySpark documentation. It has a search box, so you can search on the pages easily. And it has side bars to navigate more effectively.
One of the things that I like most is API reference. So previously, the PySpark documentation had one single page that lists all the APIs. It was very difficult to navigate one API and hard to search related APIs together. There was no proper logical grouping before. But now, it has a proper grouping with a proper structure with hierarchy.
Each module has a each group with a table and list of APIs. And the each page has its dedicated page with a sufficient example in the descriptions. Also, the new documentation has a quickstart where you can run the examples in a Live notebook out of the box. So it launches a Jupyter Notebook for you. And then, you can change the example and run one-by-one without any extra step or installation.
Lastly, we migrated the documentation style to NumPy doc style. Many other data science libraries such as pandas or [inaudible] have chosen this style. A NumPy documentation style provides not only a good readability in text format, but also is supported by many other external tools like sphinx. So the documentation can be generated automatically.
All right. Also, Spark 3.1 has the complete support of Python dependence management. So you can use Conda, Virtualenv, and PEX to ship and manage the Python dependencies for your codes. So suppose that I’m sort of newbie. So I don’t know any of Spark cluster, big data. I just heard that my company has a Spark cluster, so I just downloaded the Spark client. I copied and paste the example from the PySpark documentation, of course without reading them closely. And I ran it and it failed. And then, it said that I have to install pandas and PyArrow.
So I did. I did the pip install pandas PyArrow. I checked the versions two. And now, looks like I can define the function. So I’m going to execute and print out my DataFrame. And it said that it failed. It said that I have to have a PyArrow installed. So why does it fail? Huh? This is because we do the distributed computing. Other computers in Spark clusters should also have the same packages installed. My UDF example just now failed because it doesn’t have the packages installed it in other nodes.
Scala or Java provides several ways to manage dependencies across all nodes, like script options such as like jars, or packages, and configurations, but PySpark users often ask how to do it with Python dependencies. In Spark 3.1, we edited the complete support of Python dependence management in all cluster types. So for example, you can use Conda in your Kubernetes cluster.
To use Conda, you first create the Conda environment and install packages you want with the Python version you want. Then, after the configuring all you just tar, ship, and ran it. Under the hood, Spark cluster automatically untars and place it properly where each executed can locate. And then, the PySpark Python environment variable points that the Python executable in untarred Conda environment for each executed. Virtualenv is also almost the same. You activate the virtual environment variable, and install the packages, and then you’re done. Just tar it, and then ship it, and run it.
In case of PEX, it’s slightly different. The PEX file itself is an executable file rather than an environment like Conda or virtual environment. For example, you execute the PEX file, and then it works exactly like a proper Python interpreter. So it doesn’t require to compress or pack. You just ship it as a regular file and use it as a Python interpreter for each executions.
So all of them work in all kind of clusters for 3.1. And of course, each approach has a pros and cons. For example, the Conda and Virtualenv only work in YARN cluster with Spark 3.0 and lower. And the Virtualenv and the PEX required to have the same Python installed in the same location for all nodes. There’s a blog post that address this problem in more details, so please check out if you’re interested in. Yeah.
And then, yeah, here so far is what’s done in Spark 3.1. There are actually more cool things done with Project Zen and Spark 3.1 like PyPI Hydra installation option. So please check the Project Zen [inaudible] for more details if you’re interested in. And for the upcoming Spark 3.2, I will leave it to you, Haejoon.

Haejoon Lee: Thanks, Hyukjin. So one of the most important changes in Spark 3.2 is that now you can use pandas APIs on Spark. So this is the trend of a stack overflow question for pandas. And you can see it has radically increased. Pandas became a super famous data science tool and it has became almost a standard in this field. Many people start to learn pandas for data science.
But there is one known problem in pandas. Pandas has a limitation to deal with the large sets of data because pandas doesn’t scale out or some operation even requires more memory than their size for intermediate copy. And Apache Spark is one of the most common choices to scale their workloads out. Spark provides easy APIs for distributed computing and scale jobs out of the box.
So fixes is you have to learn all the possible APIs again to migrate your pandas codes. Many of PySpark DataFrame API were inspired by SQL APIs and they have behaviors different from pandas. For example, you should use select with areas for the rename API in pandas and group by and order by for value counts. Also, you cannot do in place update in PySpark like pandas does.
So another things is that pandas has a bunch of powerful plotting and visualization features. So when you want to browse your data, you can use plots in pandas to understand your data effectively and easily. But you should rely on tests and [inaudible] at PySpark.
So suppose I have the pandas workload like this, and this workload analyzed the number of visitors from two different websites and visualize it a total number of visitors by region. So the number of visitor are small. I can load it in pandas with no problem. So later on, the number of visitors increased and my workload start to fail because the input is too large to process in single node. So now, it fails with memory error. So I decided to convert my workload to PySpark because pandas is read_csv has a different behavior from the default PySpark csv APIs. I have to configure the options in PySpark codes.
My workload did the in-place update with the duplicating, but PySpark does not allow this, so I have to create another DataFrame in PySpark. And I also PySpark does not have the same API like values value count in pandas, so I have to mimic this behavior by using groupBy count and orderBy.
Lastly, I want to merge both counts from both websites like my workload DoS because PySpark does not allow the operations on different DataFrame out of the box. So I have to manually join on the common column and merge them. And finally, I want to draw a plot, but PySpark doesn’t have. So the data has to be collected to driver’s side, converted to pandas DataFrame, and plotted by the pandas API. So there are too many things to change and too many differences to handle. Also, converted codes are very Long in PySpark and mainly because of the behavioral differences. And also I couldn’t plot out of the box in PySpark.
So the Spark community took this seriously and raised the discussion and decided to bring the Koalas project to PySpark for pandas APIs on Spark. So the pandas APIs are implemented on Spark by PySpark APIs and maintaining the separate metadata to follow pandas [inaudible] structures. And the Spark SQL engine takes the responsibility of the execution. So the performance is close to the native Spark SQL.
With the pandas APIs on Spark, you don’t have to deal with all the differences and rewriting your codes. You just change a single line on the top and it works the same. And another cool thing in pandas APIs on Spark is that you can easily switch it to PySpark DataFrame APIs. So the existing PySpark users can also easily switch it to pandas contexts in the middle and switch back to PySpark context with almost zero overhead.
Okay. Then, let me summarize the pandas APIs on Spark. Firstly, pandas API with distributed computing, and now we can plot PySpark DataFrame easily out of the box. Also pandas APIs on Spark encapsulated immutable PySpark DataFrame and provides immutablized DataFrame layer that allows in-place updates and computing on the different DataFrames.
And lastly, you can easily switch it between pandas and PySpark context. And this was officially proposed as an SPIP in Spark community, so please read the SPIP for more details.
Okay. Then lastly, let’s talk a bit more about the visualization feature in Spark 3.2. Now, PySpark can draw a chart for your data. for example, now you can show your data in a Histogram plot that uses a machine learning library [inaudible] to compute. And also, you can draw your data with Area plot by taking samples from your data.
And also you can draw a Boxplot that uses Spark SQL expressions to compute the tests need. For all these plots in PySpark, it uses Plotly as the default plotting backend for visualization under the hood. So Plotly is interactive plotting library that allows you to manipulate your plot interactively like zoom in and zoom out. And the chart itself is pretty sophisticated and compared to the other plotting libraries.
Another things to mention is that the pandas actually uses a different plotting backend. Pandas uses matplotlib by default and PySpark uses plotly. So if you want to match the default plot behaviors with pandas, then you can set the plotting.backend option to matplotlib.
So all these plots in PySpark are implemented in three different ways. For some plots, we use PySpark libraries directly, and for others we use a sampling or taking top-N approach. So this is the case when you compute the data by PySpark APIs, plotly boxplot and histogram are computed by PySpark library such as machine learning [inaudible] or native Spark SQL expressions. And the results are collected to the driver side, and then they are plotted by the plotting backend.
And so for plots like line and area plots, we use sampling approach because the sample is good enough to show the general trend and rise. And it becomes too slow to draw a chart when there are too many data points, but there are not many differences in the output chart. And lastly, PySpark users top-N approaches for plots like pie chart because it makes less sense to include too many sections with one pie chart that makes the chart difficult to read.
All right. I’m going to go through the what we talked today. So, for Spark 3.1, we talk about the Python type hints that allows auto-completion in IDEs and Notebooks. And Spark 3.1 also has the new documentation with much more readable pages and live notebook with quickstart. And lastly, the Python dependence management is complete now.
And in case of the upcoming Spark 3.2, we introduced pandas APIs on Spark that allows many things impossible with Plane Spark… I mean, the PySpark and like in-place update that includes a visualization support that provides an interactive and pretty plots. And almost 19% of pandas plots are implemented in PySpark.

Hyukjin Kwon

Hyukjin is a Databricks software engineer, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, etc. He is also one of the top con...
Read more

Haejoon Lee

Haejoon is a software engineer at Databricks. His main interest is in Koalas and PySpark. He is one of the major contributors of the Koalas project.


もっと読む