SESSION

All About Python Dependency Management in PySpark

OVERVIEW

EXPERIENCEIn Person
TYPELightning Talk
TRACKData Engineering and Streaming
INDUSTRYEducation, Enterprise Technology
TECHNOLOGIESApache Spark
SKILL LEVELBeginner
DURATION20

Your local Python workloads often require various Python dependencies. Managing these dependencies becomes more challenging in a distributed computing environment, where ensuring all nodes have the correct environment is complex, and determining the location of the user's code execution can be tricky. In PySpark, three methods exist for managing Python dependencies. You can statically create a packed environment using tools such as Conda. Alternatively, Spark Connect can be used for session-level dependency management, integrating well with package managers.Apache Spark plans to introduce UDF-level dependency management, allowing you to specify dependencies for your individual UDF. This is particularly useful when your workloads depend on different versions of dependencies or when you want to freeze the dependencies in your UDF. This talk includes a walkthrough of the past and present of dependency management in PySpark, discussing the challenges faced and what comes next.

SESSION SPEAKERS

Haejoon Lee

/Software Engineer
Databricks

Takuya Ueshin

/Senior Software Engineer
Databricks