Jiao Wang is a software engineer on the Big Data Technology team at Intel who works in the area of big data analytics. She is engaged in developing and optimizing distributed deep learning frameworks on Apache Spark.
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing. Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks. We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted. In this talk, Maurice Nsabimana, a statistician at the World Bank, will demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premises or in the Cloud. Attendees will also learn how to write a deep learning application that leverages Spark to train image recognition models at scale. Session hashtag: #DL8SAIS
AI plays a central role in the today’s Internet applications and emerging intelligent systems, which are driving the need for scalable, distributed big data analytics with deep learning capabilities. There is increasing demand from organizations to discover and explore data using advanced big data analytics and deep learning. In this talk, we will share how we work with our users to build deep learning powered big data analytics applications (e.g., object detection, image recognition, NLP, etc.) using BigDL, an open source distributed deep learning library for Apache Spark.