Share and analyse genomic data at scale with Spark, Adam, Tachyon and the Spark Notebook - Databricks

Share and analyse genomic data at scale with Spark, Adam, Tachyon and the Spark Notebook

Download Slides

Genomics and Health data is nowadays one of the hot topics requiring lots of computations and specially machine learning. This helps science with a very relevant societal impact to get even better outcome. That is why Apache Spark and its ADAM library is a must have. This talk will be twofold. First, we’ll show how Apache Spark, MLlib and ADAM can be plugged all together to extract information from even huge and wide genomics dataset. Everything will be packed into examples from the Spark Notebook, showing how bio-scientists can work interactively with such a system. Second, we’ll explain how these methodologies and even the datasets themselves can be shared at very large scale between remote entities like hospitals or laboratories using micro services leveraging Apache Spark, ADAM, Play Framework 2, Avro and Tachyon.

About Andy Petrella

Andy is a mathematician turned into a distributed computing engineer with an entrepreneurship trait. Andy is a certified Scala/Spark trainer and wrote the Learning Play! Framework 2 book. He participated in many projects, building on top of spark, cassandra, and other distributed technologies, in various fields including Geospatial, IoT, Automotive and Smart cities projects. He is the creator of one of the top projects on GitHub related to Apache Spark and Scala, the spark-notebook ( He also co-founded, with Xavier Tordoir, the Data Fellas company dedicated to data science and distributed computing.

About Xavier Tordoir

After completing a Ph.D in experimental atomic physics, Xavier focused on the data processing part of the job, with projects in finance, genomics and software development for academic research. During that time, he worked on timeseries, on prediction of biological molecular structures and interactions, and applied Machine Learning methodologies. He developed solutions to manage and process data distributed across data centers. Since leaving academia a couple of years ago, he provides services and develops products related to data exploitation in distributed computing environments, embracing functional programming, Scala and BigData technologies.