Marco Capuccini - Databricks

Marco Capuccini

Data Scientist, Uppsala University

Marco Capuccini is a data scientist and bioinformatician. He started his carrier as a software engineer, working for IBM and Sopra Steria in Europe. After he completed his undergraduate studies in computer science and bioinformatics, he started a PhD at Uppsala University (Sweden) where is currently enrolled. Marco is developing methods to run scientific applications, that are traditionally ran on HPC clusters, on cloud resources. He uses Spark as the main tool to enable large-scale data processing in his research.


EasyMapReduce: Leverage the Power of Spark and Docker to Scale Scientific Tools in MapReduce Fashion

High-throughput methods in various scientific fields produced massive datasets in the past decade, and using Big Data frameworks, such as Apache Spark, is a natural choice to enable large-scale analysis. In scientific applications, many tools are highly optimized to resemble, or detect, some phenomenon that occurs in a certain system, and the effort of reimplementing such tools in Spark cannot be sustained by research groups. Application containers are gaining a tremendous momentum, as they allow to wrap whole software stacks, that can be then easily fired up and teared down on demand, in a matter of seconds. Docker emerges as the most broadly used containerization tool, and it represents the perfect candidate to wrap scientific application stacks. In Uppsala University (Sweden) we developed EasyMapReduce, a Spark-based utility to run Docker containers in MapReduce fashion, in order to process a large-scale distributed dataset. In this talk we will present the challenges that scientists have to face, in order to run scientific tools over large datasets, and how EasyMapReduce helped us to rapidly implement many use cases in our research group. In addition, we will discuss challenges and future plans for the EasyMapReduce implementation. Key: Spark, Docker, HDFS, Scientific Workflows, Bioinformatics Takeaways: Learn how serial software can be run in parallel, over a distributed dataset, using Spark and Docker GitHub: