SESSION

All About Python Dependency Management in PySpark

OVERVIEW

EXPERIENCE	In Person
TYPE	Lightning Talk
TRACK	Data Engineering and Streaming
INDUSTRY	Education, Enterprise Technology
TECHNOLOGIES	Apache Spark
SKILL LEVEL	Beginner
DURATION	20

Your local Python workloads often require various Python dependencies. Managing these dependencies becomes more challenging in a distributed computing environment, where ensuring all nodes have the correct environment is complex, and determining the location of the user's code execution can be tricky. In PySpark, three methods exist for managing Python dependencies. You can statically create a packed environment using tools such as Conda. Alternatively, Spark Connect can be used for session-level dependency management, integrating well with package managers.Apache Spark plans to introduce UDF-level dependency management, allowing you to specify dependencies for your individual UDF. This is particularly useful when your workloads depend on different versions of dependencies or when you want to freeze the dependencies in your UDF. This talk includes a walkthrough of the past and present of dependency management in PySpark, discussing the challenges faced and what comes next.

All About Python Dependency Management in PySpark

OVERVIEW

SESSION SPEAKERS

Haejoon Lee

Takuya Ueshin