Li Jin is a software engineer at Two Sigma. Li focuses on building high performance data analysis tools with Python and Spark for financial data. Li is a co-creator of Flint: a time series analysis library on Spark. Previously, Li worked on building large scale task scheduling system. In his spare time, Li loves hiking, traveling and winter sports.
June 23, 2020 05:00 PM PT
Pandas is the de facto standard (single-node) Data Frame implementation in Python. However, as data grows larger, pandas no longer works very well due to performance reasons. On the other hand, Spark has become a very popular choice for analyzing large dataset in the past few years. However, there is an API gap between pandas and Spark, and as a result, when users switch from pandas to Spark, they often need to rewrite their programs. Ibis is a library designed to bridge the gap between local execution (pandas) and cluster execution (BigQuery, Impala, etc). In this talk, we will introduce a Spark backend for ibis and demonstrate how users can go between pandas and Spark with the same code.
October 21, 2021 02:45 PM PT
Spark 2.3.0 set a great foundation for using Apache Arrow to increase Python performance and interoperability with Pandas. Come by and share your use cases to see if using Arrow could work to improve your Spark jobs. Discuss possible next steps for leveraging Arrow in Spark, and how it would jumpstart Machine Learning and Deep Learning workloads.
June 4, 2018 05:00 PM PT
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark's runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability.
In this talk, we introduce a new type of PySpark UDF designed to solve this problem - Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds - the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.
Session hashtag: #Py1SAIS