Li Jin is a software engineer at Two Sigma. Li focuses on building high performance data analysis tools with Spark for financial data. Li is a co-creator of Flint: a time series analysis library on Spark. Previously, Li worked on building large scale task scheduling system. In his spare time, Li loves hiking, traveling and winter sports.
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark's runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability. In this talk, we introduce a new type of PySpark UDF designed to solve this problem - Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds - the ability to define easy to use, high performance UDFs and scale up your analysis with Spark. Session hashtag: #Py1SAIS