In the past years, the increasing computational power has made possible larger scientific experiments that have high computing demands, such as brain tissue simulations. In general, larger simulations imply generating larger amounts of data that need to be then analyzed by neuroscientists. Currently, simulation reports are analyzed by neuroscientists with the help of Python scripts, thanks to its programming simplicity and its performance of the NumPy library.
However, this analysis workflow will become unfeasible in the near future, as we foresee a 10x increase of the dataset size in the next year. Therefore, we are exploring how to accelerate data analysis of brain activity simulations with big data technologies, like Spark. In this talk, we will present how we address this challenge: from building RDDs/DataFrames from custom binary files to data queries and transformations to achieve the desired scientific analyses. In order to reach our goals, we have implemented our workflow in five different ways, combining RDDs, DataFrames, different data structures and representations and different data partitioning.
After significant engineering and programming efforts, we would like to share with the community our lessons learned: how Spark features can leverage data analysis in our neuroscience research area and what type of decisions can negatively impact performance. Moreover, we would also like to open a discussion with some critical limitations we have found in Spark applied to our use cases, and how to address them in the future as a joint community effort. In brief, as takeaway messages, we will highlight the suitability of Spark for our data analysis, how data generation can highly impact subsequent data analysis and how the decision of data types and formats can have a significant impact in Spark performance. We will present our experiments run on Cooley, the Argonne National Laboratory (ANL) data analysis cluster.
Session hashtag: #Py5SAIS
Judit Planas received her Ph.D. in Computer Architecture from the Technical University of Catalonia (UPC, Spain) in 2015. She worked at the Barcelona Supercomputing Center from 2008-2015, where she developed her MSc and PhD in programming models for heterogeneous architectures. From 2015, she is a Postdoctoral Researcher at the Blue Brain Project, Ecole Polytechnique Federale de Lausanne (EPFL, Switzerland). Her work focuses on memory-intensive applications and big data solutions applied to neuroscience. She has published her work in international conferences and journals and has been invited to participate in different events as a speaker, panelist or program committee member.