Rethinking Data-Intensive Science Using Scalable Analytics Systems - Databricks
Research

Rethinking Data-Intensive Science Using Scalable Analytics Systems

Authors: Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson

Download Paper

Abstract

“Next generation” data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current analytics systems to accelerate scientific processing pipelines. In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28× speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity “big data” systems. To demonstrate the generality of our architecture, we then implement a scalable astronomy image processing system which achieves a 2.8–8.9× improvement over the state-of-the-art MPI-based system.

Related Content

Authors: Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia

Authors: Zhao Zhang, Kyle Barbary, Frank Austin Nothaft, Evan R. Sparks, Oliver Zahn, Michael J. Franklin, David A. Patterson, Saul Perlmutter

Authors: Aditya Ganjam, Junchen Jiang, Xi Liu, Vyas Sekar, Faisal Siddiqui, Ion Stoica, Jibin Zhan, Hui Zhang

Authors: Anand Padmanabha Iyer, Li Erran Li, Ion Stoica

Authors: Samia N. Naccache, Scot Federman, Narayanan Veeeraraghavan, Matei Zaharia, Deanna Lee, Erik Samayoa, Jerome Bouquet, Alexander L. Greninger, Ka-Cheung Luk, Barryett Enge, Debra A. Wadford, Sharon L. Messenger, Gillian L. Genrich, Kristen Pellegrino, Gilda Grard, Eric Leroy, Bradley S. Schneider, Joseph N. Fair, Miguel A. Martı´nez, Pavel Isa, John A. Crump, Joseph L. DeRisi, Taylor Sittler, John Hackett, Jr., Steve Miller, Charles Y. Chiu

Authors: Matt Massie, Frank Nothaft, Christopher Hartl, Christos Kozanitis, André Schumacher, Anthony D. Joseph, David A. Patterson