Daniel Galvez works on MLCommons’s datasets working group. Previously, he worked on machine learning and content recommendation at LinkedIn. He has a BS in computer science from Cornell University, where he developed GPU-accelerated training of deep neural networks for automatic speech recognition. He is a maintainer of the Kaldi Automatic Speech Recognition toolkit and is the MLCommons (https://mlcommons.org/) Speech Recognition Inference benchmark owner.
May 26, 2021 11:30 AM PT
As part of its machine learning benchmarking efforts, MLCommons (mlcommons.org) has built an 86,000 hour open supervised speech recognition dataset with a commercial-use license known as The People’s Speech, incorporating subtitled videos and audio in the public domain scraped from the Internet. Creating a speech recognition dataset requires running inference on a pre-trained neural network speech recognition model to “force align” audio against a transcript (in this case, subtitles). In order to improve upon an initial CPU-based pipeline that took approximately 3,500 CPU days to one that takes 24 hours end-to-end, we created a hybrid data pipeline that used Apache Spark for general data processing and Google Cloud Tensor Processing Units (TPUs) for running the neural network speech recognition model.
I will describe in-the-weeds learnings on how to (1) use a non-GPU accelerator with Spark for inference, (2) share physical memory fairly between the pyspark UDF worker.py process and JVM process in the same executor, and (3) implement efficient joins of data that has been reordered relative to its source dataframe by batching by sequence length (tf.data.experimental.bucket_by_sequence_length).
If you do offline inference on sequence data with deep learning models, this session is for you. Our entire pipeline is open source under an Apache 2 license at https://github.com/mlcommons/peoples-speech.