High-Throughput Genomics at Your Fingertips with Apache Spark

Download Slides

In the past two decades novel genome sequencing technologies have revolutionized biomedical and agronomical research. In 2001, the human genome sequence was presented as the result of ten years of work by a worldwide consortium of universities and companies, with an associated price tag of three billion dollars. Today, laboratories around the world are routinely sequencing (“genotyping”) hundreds of human, animal and plant genomes in several days’ time for a few thousand dollars each. Upcoming developments in the field of DNA sequencing promise to bring these costs down even further and increase the throughput of data production. At the same time, automated digital image capturing and processing technologies allow for accurate recording of disease progression and agronomic important traits such as yield that these individuals display (“phenotyping”). This tremendous wealth of data allows scientists to identify correlations between genotypes and phenotypes, thereby identifying causal DNA variants in a population that play a role in the observed disease or trait. In plant breeding in particular, combining these data together with sensory data from greenhouses, consumer and business data presents an opportunity towards a data-driven approach to crop improvement. The ever-increasing amount and diversity of these genetic, trait and sensory data together with the need for rapid diagnostics and iterative refinement of these genotype-by-phenotype analyses demand scalable compute solutions to match the growth of data production and diversity. In this presentation I will show how KeyGene leverages Apache Spark to perform Genome-Wide Association Studies (GWAS) on thousands of sequenced plant genomes. I will explain how we use state-of-the-art public software and in-house developed Scala code to analyze genome sequence data, identify DNA variants and correlate these variants to phenotypes. I will highlight the current state-of-the-art in genomics on Apache Spark and conclude by identifying areas of interest for further development.

About Erwin Datema

Erwin Datema is a Bioinformatics Scientist at Keygene N.V., the Netherlands. He obtained his PhD in Bioinformatics at Wageningen University where he was involved in the assembly and annotation of the tomato and potato genomes during the advent of Next Generation Sequencing technologies. During the last five years at Keygene he has been involved in streamlining NGS data analysis, technology development for Sequence-Based Genotyping, comparative genomics and pan-genome analysis. Erwin has a keen interest in novel sequencing technologies and the promise that they hold to accelerate molecular plant breeding, in particular in combination with high-performance computing.

About Roeland van Ham

Roeland van Ham received his PhD in Molecular Evolutionary Biology in 1994. Between 1995 and 2002, he did postdocs in genomic adaptation in Spain and the UK. In 2002, he joint Wageningen UR, where he was appointed group leader Bioinformatics and part-time associate professor Bioinformatics. In 2011, he was appointed Vice President Bioinformatics and Modeling at KeyGene N.V., an agro-biotech company were he leads an R&D department in the development of computational applications for accelerated crop improvement. Since 2015, he combines his work at KeyGene with an appointment as professor in Plant Computational Biology at the Technical University Delft.