Luca is a data engineer at CERN with the Hadoop, Spark, streaming and database services. Luca has 20 years of experience with architecting, deploying and supporting enterprise-level database and data services with a special interest in methods and tools for performance troubleshooting. Luca is working in developing and supporting solutions for data analytics and ML for the CERN community, including LHC experiments, the accelerator sector and CERN IT. He enjoys taking part and sharing knowledge with the open source, science, and industry data community at large.
Performance troubleshooting of distributed data processing systems is a complex task. Apache Spark comes to rescue with a large set of metrics and instrumentation that you can use to understand and improve the performance of your Spark-based applications. You will learn about the available metric-based instrumentation in Apache Spark: executor task metrics and the Dropwizard-based metrics system. The talk will cover how Hadoop and Spark service at CERN is using Apache Spark metrics for troubleshooting performance and measuring production workloads. Notably, the talk will cover how to deploy a performance dashboard for Spark workloads and will cover the use of sparkMeasure, a tool based on the Spark Listener interface. The speaker will discuss the lessons learned so far and what improvements you can expect in this area in Apache Spark 3.0.
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.
Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.
We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating "Big Data" solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML. Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related "Big Data" tools for a community of developers and DBAs at CERN with a background in relational database operations. Session hashtag: #SAISDev11