SESSION
Pandas on Spark: Simplicity of Pandas with Efficiency of Spark
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Data Engineering and Streaming |
INDUSTRY | Enterprise Technology, Health and Life Sciences, Financial Services |
TECHNOLOGIES | Apache Spark |
SKILL LEVEL | Beginner |
DURATION | 40 min |
DOWNLOAD SESSION SLIDES |
With Python as the go-to language for data science, pandas has gained immense popularity in the data science community, as it is simple to learn and use, while powerful, expressive, and flexible. As data volumes grow, a key drawback of pandas is its inability to scale with increasing data volumes since it processes everything on a single machine. Pandas API on Spark addresses this issue, empowering users to handle vast datasets by leveraging the power of Apache Spark under the hood for scalable, distributed data processing while just using the pandas API. In addition, Pandas on Spark enhances pandas by offering access to SQL and machine learning utilities, enabling scalable data processing and analysis.In this talk, we will give an overview of Pandas on Spark: how to get started and also how to use it with your existing pandas code to scale your existing data science workloads using Pandas on Spark.
SESSION SPEAKERS
Matthew Powers
/Staff Developer Advocate
Databricks
Xinrong Meng
/Senior Software Engineer
Databricks