Research

ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing

Authors: Matt Massie, Frank Nothaft, Christopher Hartl, Christos Kozanitis, André Schumacher, Anthony D. Joseph, David A. Patterson

Download Paper

Abstract

Current genomics data formats and processing pipelines are not designed to scale well to large datasets. The current Sequence/Binary Alignment/Map (SAM/BAM) formats were intended for single node processing. There have been attempts to adapt BAM to distributed computing environments, but they see limited scalability past eight nodes. Additionally, due to the lack of an explicit data schema, there are well known incompatibilities between libraries that implement SAM/BAM/Variant Call Format (VCF) data access. To address these problems, we introduce ADAM, a set of formats, APIs, and processing stage implementations for genomic data. ADAM is fully open source under the Apache 2 license, and is implemented on top of Avro and Parquet for data storage. Our reference pipeline is implemented on top of Spark, a high performance in-memory map-reduce system. This combination provides the following advantages: 1) Avro provides explicit data schema access in C/C++/C#, Java/Scala, Python, php, and Ruby; 2) Parquet allows access by database systems like Impala and Shark; and 3) Spark improves performance through in-memory caching and reducing disk I/O.

Related Content

Authors: Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia

Authors: Zhao Zhang, Kyle Barbary, Frank Austin Nothaft, Evan R. Sparks, Oliver Zahn, Michael J. Franklin, David A. Patterson, Saul Perlmutter

Authors: Aditya Ganjam, Junchen Jiang, Xi Liu, Vyas Sekar, Faisal Siddiqui, Ion Stoica, Jibin Zhan, Hui Zhang

Authors: Anand Padmanabha Iyer, Li Erran Li, Ion Stoica

Authors: Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson

Authors: Samia N. Naccache, Scot Federman, Narayanan Veeeraraghavan, Matei Zaharia, Deanna Lee, Erik Samayoa, Jerome Bouquet, Alexander L. Greninger, Ka-Cheung Luk, Barryett Enge, Debra A. Wadford, Sharon L. Messenger, Gillian L. Genrich, Kristen Pellegrino, Gilda Grard, Eric Leroy, Bradley S. Schneider, Joseph N. Fair, Miguel A. Martı´nez, Pavel Isa, John A. Crump, Joseph L. DeRisi, Taylor Sittler, John Hackett, Jr., Steve Miller, Charles Y. Chiu