Our research group is investigating how to leverage Apache Spark (batch, streaming & real-time) to analyse current and future data sets in astronomy. Among the future large experiments, the Large Synoptic Survey Telescope (LSST) will start soon collecting terabytes of data per observation night, and the efficient processing and analysis of both real-time and historical data remains a major challenge. In this talk we will expose the main challenges and explore the latest developments tailored for big data problems in astronomy. On the one hand we designed a new Data Source API extension to natively manipulate telescope images and astronomical tables within Apache Spark. We then extended the functionalities of the Apache Spark SQL module to ease the manipulation of 3D data sets and perform efficient queries: partitioning, data sets join and cross-match, nearest neighbors search, spatial queries, and more. On the other hand we are using the new possibilities offered by Structured Streaming APIs in recent Apache Spark versions to enable real-time decisions by rapidly accessing and analysing the alerts sent by telescopes every night. Given the unprecedented precision of next generation of telescopes, the streams of alerts will be made of millions of alerts per night, and relying on Structured Streaming is a guarantee of not missing the latest Black Hole event in a sea of data! We will also share active learning developments used on top to improve real-time event selection and classification for the LSST telescope. You will walk away with an understanding of modern challenges in astronomy, appreciate some beautiful night skies, and how Apache Spark can help pushing further the frontiers of Science!
Julien Peloton is a research software engineer working at CNRS/University Paris-Saclay. For many years, he analyzed the data from a telescope observing the oldest accessible light in the universe (Cosmic Microwave Background). He now spends most of his time working with Apache Spark and Apache Kafka, sharing R&D efforts between groups, developing computing infrastructure for researchers, improving interoperability between industry and research in open source projects, and helping research communities to take advantage of the big data ecosystem tools.