Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs

May 26, 2021 11:30 AM (PT)

Download Slides

As part of its machine learning benchmarking efforts, MLCommons ( has built an 86,000 hour open supervised speech recognition dataset with a commercial-use license known as The People’s Speech, incorporating subtitled videos and audio in the public domain scraped from the Internet. Creating a speech recognition dataset requires running inference on a pre-trained neural network speech recognition model to “force align” audio against a transcript (in this case, subtitles). In order to improve upon an initial CPU-based pipeline that took approximately 3,500 CPU days to one that takes 24 hours end-to-end, we created a hybrid data pipeline that used Apache Spark for general data processing and Google Cloud Tensor Processing Units (TPUs) for running the neural network speech recognition model.


I will describe in-the-weeds learnings on how to (1) use a non-GPU accelerator with Spark for inference, (2) share physical memory fairly between the pyspark UDF process and JVM process in the same executor, and (3) implement efficient joins of data that has been reordered relative to its source dataframe by batching by sequence length (


If you do offline inference on sequence data with deep learning models, this session is for you. Our entire pipeline is open source under an Apache 2 license at


In this session watch:
Daniel Galvez, Software Engineer, MLCommons


Daniel Galvez

Daniel Galvez works on MLCommons's datasets working group. Previously, he worked on machine learning and content recommendation at LinkedIn. He has a BS in computer science from Cornell University, wh...
Read more