Luca Canali

Data Engineer, CERN

Luca is a data engineer at CERN with the Hadoop, Spark, streaming, and database services. Luca has 20+ years of experience with designing, deploying, and supporting enterprise-level database and data services with a special interest in methods and tools for performance troubleshooting. Luca is active in developing and supporting platforms for data analytics and ML for the CERN community, including the LHC experiments, the accelerator sector, and CERN IT. He enjoys sharing experience and knowledge with data communities in science and industry at large.

Past sessions

This talk will cover some practical aspects of Apache Spark monitoring, focusing on measuring Apache Spark running on cloud environments, and aiming to empower Apache Spark users with data-driven performance troubleshooting. Apache Spark metrics allow extracting important information on Apache Spark’s internal execution. In addition, Apache Spark 3 has introduced an improved plugin interface extending the metrics collection to third-party APIs. This is particularly useful when running Apache Spark on cloud environments as it allows measuring OS and container metrics like CPU usage, I/O, memory usage, network throughput, and also measuring metrics related to cloud filesystems access. Participants will learn how to make use of this type of instrumentation to build and run an Apache Spark performance dashboard, which complements the existing Spark WebUI for advanced monitoring and performance troubleshooting.

In this session watch:
Luca Canali, Data Engineer, CERN

[daisna21-sessions-od]

Summit Europe 2020 What is New with Apache Spark Performance Monitoring in Spark 3.0

November 17, 2020 04:00 PM PT

Apache Spark and its ecosystem provide many instrumentation points, metrics, and monitoring tools that you can use to improve the performance of your jobs and understand how your Spark workloads are utilizing the available system resources. Spark 3.0 comes with several important additions and improvements to the monitoring system. This talk will cover the new features, review some readily available solutions to use them, and will provide examples and feedback from production usage at the CERN Spark service. Topics covered will include Spark executor metrics for fine-grained memory monitoring and extensions to the Spark monitoring system using Spark 3.0 Plugins. Plugins allow us to deploy custom metrics extending the Spark monitoring system to measure, among other things, I/O metrics for cloud file systems like S3, OS metrics, and custom metrics provided by external libraries.

Speaker: Luca Canali

Summit Europe 2019 Performance Troubleshooting Using Apache Spark Metrics

October 16, 2019 05:00 PM PT

Performance troubleshooting of distributed data processing systems is a complex task. Apache Spark comes to rescue with a large set of metrics and instrumentation that you can use to understand and improve the performance of your Spark-based applications. You will learn about the available metric-based instrumentation in Apache Spark: executor task metrics and the Dropwizard-based metrics system. The talk will cover how Hadoop and Spark service at CERN is using Apache Spark metrics for troubleshooting performance and measuring production workloads. Notably, the talk will cover how to deploy a performance dashboard for Spark workloads and will cover the use of sparkMeasure, a tool based on the Spark Listener interface. The speaker will discuss the lessons learned so far and what improvements you can expect in this area in Apache Spark 3.0.

You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.

Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.

We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.

This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating "Big Data" solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.

Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related "Big Data" tools for a community of developers and DBAs at CERN with a background in relational database operations.

Session hashtag: #SAISDev11