Spark meets Genomics: Helping Fight the Big C with the Big D

Download Slides

For the first 50 years of computer science research, we didn’t need to talk to people in other fields because the problems were so obvious; hardware was slow, programming was difficult, and software was buggy. After 50 years of progress, some are now looking beyond core CS, I now believe there are more opportunities now between disciplines than there are within them. As one of the founders of the AMP Lab––which is building software tools to analyze Big Data by utilizing cloud computing and machine learning––I looked beyond CS for applications that used data that was both large and compelling. The health experts I talked to all argued that genetics is a critical application for science in general and personalized medicine in particular. Thus, we decided to target genetics.

Using cloud computing, Spark, and open source development, we believe we can dramatically improve genetics processing. Today, it takes more than a week to process a genome; we believe we can do it at higher accuracy in one hour. Here are examples to bolster our beliefs:

● Our first genetics project was a short-read sequencing aligner, which is one of the most expensive data processing steps. The result is SNAP (Scalable Nucleotide Alignment Program), which is one of the most accurate and by far the fastest aligner at 3-10 time faster. SNAP’s success bolstered our confidence, so we started several more efforts:

● ADAM is a cluster friendly storage format for genetic information that embraces modern systems technology to accelerate other steps of the genomic processing software pipeline. For example, ADAM executes two of the most expensive steps 110 times faster using an 82-node cluster.

● Another expensive step in a genomics pipeline is identifying the differences between the standard human reference and each person, named variant calling. Papers proposing new variant callers typically use unique data sets and metrics, as genetics benchmarks do not exist. Thus, we proposed SMaSH, a variant calling benchmark suite with appropriate evaluation metrics. As there is no real ground truth for genetics—the technology cannot yet specify 3B base pairs perfectly—it is trickier than in CS. Just as CS fields accelerate when benchmarks are embraced, we hope that SMaSH will accelerate variant calling.

About David Patterson

David Patterson joined UC Berkeley in 1977 after receiving all his degrees from UCLA. His most successful projects have likely been Reduced Instruction Set Computers (RISC), Redundant Arrays of Inexpensive Disks (RAID), and Network of Workstations (NOW). All three projects helped lead to multibillion-dollar industries. This research led to many papers and six books, with the best known book being Computer Architecture: A Quantitative Approach, co-authored by John Hennessy, and the most recent book being Engineering Software as a Service, co-authored by Armando Fox. His current research is centered on cancer genomics for UC Berkeley’s AMP and ASPIRE Labs. In the past, he served as Director of the Parallel Computing Lab (Par lab), Director of the Reliable And Distributed Systems Lab (RAD Lab), Chair of UC Berkeley’s CS Division, Chair of the Computing Research Association (CRA), and President of the Association for Computing Machinery (ACM). This work resulted in 35 honors, some shared with friends. His research awards include election to the National Academy of Engineering, the National Academy of Sciences, and the Silicon Valley Engineering Hall of Fame along with being named Fellow of ACM, the Computer History Museum, IEEE, and both AAAS organizations. His teaching honors include the ACM Karlstrom Outstanding Educator Award, the IEEE Mulligan Education Medal, and the UC Berkeley Distinguished Teaching Award. He also received Distinguished Service Awards from ACM, CRA, and SIGARCH.