The n-dimensional array (ndarray) is a ubiquitous data structure in scientific computing, whether analyzing time-varying movies of neural activity, collections of satellite images, or sensor time series. The ndarray generalizes the two-dimensional matrix to support data structures spanning multiple dimensions, and many applications could benefit from efficient distributed implementations. While Spark’s DataFrame provides rich support for large tabular data, handling ndarrays remains a challenge. This talk will introduce Bolt, an open-source implementation of an ndarray built on PySpark. Bolt provides a familiar API enabling distributed computations across one or more array dimensions at a time. It also implements an efficient chunking scheme to minimize shuffle complexity as well as analysis of shape information to simplify error handling. I will describe the implementation and show examples of scientific use cases, as well as explain how Bolt fits into the more general scientific Python ecosystem across both local and distributed settings.
I am a computational neuroscientist in the research group of Jerermy Freeman at the Janelia Research Campus. Broadly, my work focuses on understanding the neural circuits that the brain uses to implement the myriad computations that it needs to perform in everyday life. One major component of this is analyzing the large data sets of neural activity coming from cutting-edge neuroscience experiments. Through this avenue, I am also interested in creating quality open-source software tools that the scientific community can use to gain insights from such data sets.