This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Session hashtag: #SAISDev11
Luca is an engineer and team lead at the CERN Hadoop, Spark and database services. Luca has 17+ year of experience with architecting, deploying and supporting enterprise-level database services with a special interest in methods and tools for performance troubleshooting in the Linux environment. In his current role Luca is involved in developing and supporting Hadoop and Spark services and data analytics solutions for the CERN community, including LHC experiments, the accelerator sector and CERN IT.