HEP Data Processing with Apache Spark - Databricks

HEP Data Processing with Apache Spark

Download Slides

The DEEP-EST is the European Project building a new generation of the Modular
Supercomputer Architecture (MSA). The MSA is a blueprint for heterogenous HPC systems supporting high performance compute and data analytics workloads with highest efficiency and scalability.

High Energy Physics (HEP) field is soon to enter the Exascale Regime. The unprecedented amounts of collected and simulated data at the Large Hadron Collider (LHC) require new approaches for physics data processing and analysis. In our work, we explore the possibility to utilize Apache Spark for physics data analysis. We first discuss yet another Data Source API extension (spark-root, https://github.com/diana-hep/spark-root) which allows to ingest HEP physics data directly allowing to process data stored in the specialized data format used for high energy physics.

Given that 100s of Peta Bytes (PBs) of recorded physics collisions stored on disks and tapes, avoiding data format conversion is key for this project. Furthermore, experience of using Apache Spark to perform scientific computations will be shared together with several examples of physics analysis pipelines used in searches for novel physics phenomena.

Session hashtag: #SAISEco8

About Viktor Khristenko

Software Engineer @CERN starting September 2017 - PhD in High Energy Physics - interested in high performance computing / data analytics.