Petastorm is a popular open-source library from Uber that enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. We are excited to announce that Petastorm 0.9.0 supports the easy conversion of data from Apache Spark DataFrame to TensorFlow Dataset and PyTorch DataLoader. The new Spark Dataset Converter API makes it easier to do distributed model training and inference on massive data, from multiple data sources. The Spark Dataset Converter API was contributed by Xiangrui Meng, Weichen Xu, and Liang Zhang (Databricks), in collaboration with Yevgeni Litvin and Travis Addair (Uber).
A key step in any deep learning pipeline is converting data to the input format of the DL framework. Apache Spark is the most popular big data framework. The data conversion process from Apache Spark to deep learning frameworks can be tedious. For example, to convert an Apache Spark DataFrame with a feature column and a label column to a TensorFlow Dataset file format, users need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling columns in the Spark DataFrames. Those engineering frictions hinder the data scientists’ productivity.
Databricks contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
The Spark Dataset Converter API provides the following features:
Checkout the links in the Resources section for more details.
Try out the end-to-end example notebooks linked below and in the Resources section on Databricks Runtime for Machine Learning 7.0 Beta with all the requirements installed.
AWS Notebooks
Azure Notebooks
Thanks to Petastorm authors Yevgeni Litvin and Travis Addair from Uber for the detailed reviews and discussions to enable this feature!
Databricks documentation with end-to-end examples ( AWS | Azure )
Petastorm GitHub Homepage
Petastorm SparkDatasetConverter API documentation