Dataframe’s have a rich api and great performance that makes one want to adopt them in their warehouse. Those performance wins become even greater when the underlying data is stored in a columnar format, like parquet. On the other hand, refactoring all of a warehouse’s pipelines to use Dataframes, and migrating all of the underlying data to Parquet is no small undertaking. Are there any steps that can be taken to make adopting Dataframes easier? Shopify provides customers both internal and external with reporting and analytics built on the back of Spark and Pyspark. Our initial implementation of this architecture utilized a large data warehouse of structured JSON. As time went on, we felt the weight of this decision, and decided to evolve our pipeline using Parquet and Dataframes. At Shopify, we deal with billions of financial transactions and other user generated events amounting to petabytes of data. These petabytes of JSON have been converted over to Parquet, and we’ve refactored our biggest jobs to use Dataframes. The difference in speed is incredible, but getting to this point wasn’t easy. In the process, we discovered that a dataframe’s data types are not always 1-to-1 with Python’s datatypes, and that working with a Dataframe’s stricter structure is not always the same as working with RDD’s. We also found ways to make the process a lot easier, and without the need for any downtime. This talk will focus on the methods we used, and the lessons we learned along the way when adopting Dataframes in our warehouse.
Sol is currently a Data Engineer at Shopify, where he writes tooling around PySpark. He recently lead the effort of migrating their data warehouse to parquet. He's passionate about data science, computer science, genetics, and cycling.
Franklyn is a Data Engineer at Shopify, where he works on building tools to empower Data Analysts to use pyspark to solve big data analytics problems. He enjoys the challenges of scalability and is a Spark contributor.