This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Session hashtag: #SAISDev11
Luca is a data engineer at CERN with the Hadoop, Spark, streaming and database services. Luca has 18+ years of experience with architecting, deploying and supporting enterprise-level database and data services with a special interest in methods and tools for performance troubleshooting. Luca is involved in developing and supporting solutions and platforms for data analytics and machine learning for the CERN community, including LHC experiments, the accelerator sector and CERN IT and enjoys taking part and sharing knowledge with the open source data community at large.