A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analysis At Scale

Download Slides

DNA-based analysis is finding an increasing number of applications, addressing a wide range of challenges from fast cancer diagnostics to bacterial culture identification in the food industry. The broad adoption of these techniques in practice is being hampered by the immense computational complexity of the used algorithms, combined with the large size of the data sets needed in the analysis. Currently, analysis times exceed a day and costs range from $200 to $600 per analysis run. We will present the architecture of a scalable multi-user Spark cluster with bare metal and accelerator (e.g., GPUs or FPGAs) resources allocated by a distributed scheduler. We will showcase the use of GATK, one of the most widely-used cancer diagnostics pipelines in the industry, on top of Spark. We will present a number of optimization techniques such as data segmentation and dynamic load-balancing to increase efficiency. A performance improvement of 13x can be achieved on a 4-node cluster as compared to a single-node GATK pipeline run, leading to analysis times of