For the first 50 years of computer science research, we didn’t need to talk to people in other fields because the problems were so obvious; hardware was slow, programming was difficult, and software was buggy. After 50 years of progress, some are now looking beyond core CS, I now believe there are more opportunities now between disciplines than there are within them. As one of the founders of the AMP Lab––which is building software tools to analyze Big Data by utilizing cloud computing and machine learning––I looked beyond CS for applications that used data that was both large and compelling. The health experts I talked to all argued that genetics is a critical application for science in general and personalized medicine in particular. Thus, we decided to target genetics.
Using cloud computing, Spark, and open source development, we believe we can dramatically improve genetics processing. Today, it takes more than a week to process a genome; we believe we can do it at higher accuracy in one hour. Here are examples to bolster our beliefs:
● Our first genetics project was a short-read sequencing aligner, which is one of the most expensive data processing steps. The result is SNAP (Scalable Nucleotide Alignment Program), which is one of the most accurate and by far the fastest aligner at 3-10 time faster. SNAP’s success bolstered our confidence, so we started several more efforts:
● ADAM is a cluster friendly storage format for genetic information that embraces modern systems technology to accelerate other steps of the genomic processing software pipeline. For example, ADAM executes two of the most expensive steps 110 times faster using an 82-node cluster.
● Another expensive step in a genomics pipeline is identifying the differences between the standard human reference and each person, named variant calling. Papers proposing new variant callers typically use unique data sets and metrics, as genetics benchmarks do not exist. Thus, we proposed SMaSH, a variant calling benchmark suite with appropriate evaluation metrics. As there is no real ground truth for genetics—the technology cannot yet specify 3B base pairs perfectly—it is trickier than in CS. Just as CS fields accelerate when benchmarks are embraced, we hope that SMaSH will accelerate variant calling.
David Patterson is the Pardee Professor of Computer Science, Emeritus at the University of California at Berkeley, which he joined after graduating from UCLA in 1976. Dave's research style is to identify critical questions for the IT industry and gather inter-disciplinary groups of faculty and graduate students to answer them. The best known projects were Reduced Instruction Set Computers (RISC), Redundant Array of Inexpensive Disks (RAID), and Networks of Workstations (NOW), each of which helped lead to billion dollar industries. A measure of the success of projects is the list of awards won by Patterson and as his teammates: the ACM A.M. Turing Award, the C & C Prize, the IEEE von Neumann Medal, the IEEE Johnson Storage Award, the SIGMOD Test of Time award, the ACM-IEEE Eckert-Mauchly Award, and the Katayanagi Prize. He was also elected to both AAAS societies, the National Academy of Engineering, the National Academy of Sciences, the Silicon Valley Engineering Hall of Fame, and to be a Fellow of the Computer History Museum. The full list includes about 40 awards for research, teaching, and service. In his spare time he coauthored seven books---including two with John Hennessy who is past President of Stanford University and with whom he shared the Turing Award--- Patterson also served as Chair of the Computer Science Division at UC Berkeley, Chair of the Computing Research Association, and President of ACM.