Analysis Andromeda Galaxy Data Using Spark - Databricks

Analysis Andromeda Galaxy Data Using Spark

Download Slides

The Andromeda Galaxy, or M31, is a spiral galaxy approximately 2.5 million light years away from the Milky Way. As the nearest large external galaxy, it allows us to study galaxy features not visible in our own Milky Way due to our position within the galaxy. Recent studies have shown that the disc and halo of the Andromeda Galaxy extend further than previously thought. Rafiei Ravandi et al 2016 extended previous surveys of Andromeda at mid-infrared wvalengths to produce a catalog containing 426,529 objects. We have used the Apache Spark API for Python in order to cross correlate these objects with previous astronomical catalogs, such as SIMBAD, NED, and MAST (over 11 million objects). The aim is to know whether the objects from the new survey are all part of the M31 galaxy or are part of the background or foreground. The Spark-Python code makes full use of Spark RDDs in order to join multiple catalogs in a single table; this helps us to predict if a particular object is in fact part of Andromeda.We used key-value pairs in order to reduce the data duplicate data from the MAST catalog, and using groupByKey, we can classify a particular astronomical object using previous catalogs. We can conclude that our new tool can help us to better understand multiple astronomical catalogs for the Andromeda galaxy, such as resolution between astronomical catalogs, and the region in the galaxy where the astronomical objects (such as X-ray binaries, or black holes) dwell.

About Jose Nandez

Jose received his PhD in Computational Astrophysics from University of Alberta, Canada. Jose discovered his passion for Data Science and HPC while doing his PhD. Jose was able to analyse his 3D computational simulation data in order to study the dynamical evolution of 2 stars. After this work, Jose joined SHARCNET, and HPC consortium, based at University of Western Ontario, where his main interest is to