Whole genome based metagenomics analyses hold the key to discover novel species from microbial communities, reveal their full metabolic potentials, and understand their interactions with each other. Metagenomics projects based on next generation sequencing typically produce 100GB to 1000GB unstructured data. Unlike many other big data problems, analysis of metagenomics data often generates temporary files with 100 to 1000 times of the original size, posing a significant challenge in both hardware infrastructure and software algorithms. Here we report our experience with evaluating Apache Spark in metagenomics data analysis for its speed, scalability, robustness, and most importantly, ease of programming. We developed a Spark-based scalable metagenomics application to deconvolute individual genomes from a complex microbial community with thousands of species. We then systematically tested its performance on synthetic and real world datasets using the Elastic MapReduce framework provided by Amazon Web Services. Our preliminary results suggest Spark provides a cost-effective solution with rapid development/deployment cycles for metagenomics data analysis. These experience likely extends to other big genomics data analyses, in both research and production settings.
Dr. Zhong Wang is a career computational biologist and group leader for genome analysis at Lawrence Berkeley National Lab. His research interests include transcriptomics, metagenomics, and high performance computing. Dr. Wang published over 30 high quality papers including several on Science and Nature.