The physicists at CERN are increasingly turning to Spark to process large physics datasets in a distributed fashion with the aim of reducing time-to-physics with increased interactivity. The physics data itself is stored in CERN’s mass storage system: EOS and CERN’s IT department runs on-premise private cloud based on OpenStack as a way to provide on-demand compute resources to physicists. This provides both opportunity and challenges to Big Data team at CERN to provide elastic, scalable, reliable spark-as-a-service on OpenStack.
The talk focuses on the design choices made and challenges faced while developing spark-as-a-service over kubernetes on openstack to simplify provisioning, automate management, and minimize the operating burden of managing Spark Clusters. In addition, the service tooling simplifies submitting applications on the behalf of the users, mounting user-specified ConfigMaps, copying application logs to s3 buckets for troubleshooting, performance analysis and accounting of spark applications and support for stateful spark streaming applications. We will also share results from running large scale sustained workloads over terabytes of physics data.
Session hashtag: #SAISEco11
Prasanth Kothuri is currently working as Sr Big Data Engineer for CERN in defining and architecting the next generation of Data Analytics platform based on Hadoop and Spark. He's working with various user communities at CERN in building data analytics solutions around Apache Hadoop, Apache Spark, and Apache Kudu for the past 3 years. Before this, he was an Oracle Database specialist for a decade, covering all areas from performance tuning to upgrading databases and disaster recovery to securing databases.
Piotr Mrowczynski is Big Data Software Engineer at CERN and Master Student in Cloud Computing and Services at KTH Royal Institute of Technology in Sweden. He's currently working on large scale Spark-as-a-Service, to enable elastic, big data analytics deployments for physics.