Cesar Delgado

Manager Siri Open Source Technology, Apple

Cesar has been involved with Big Data since 2008 and been working on Siri since the Apple acquisition. He has also worked on other teams at Apple including iTunes, iCloud, News and Maps helping with processing pipelines and architecture.

Past sessions

Summit 2019 Making Nested Columns as First Citizen in Apache Spark SQL

April 23, 2019 05:00 PM PT

Apple Siri is the world's largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. We use large amounts of data to provide our users the best possible personalized experience. Our raw event data is cleaned and pre-joined into an unified data for our data consumers to use. To keep the rich hierarchical structure of the data, our data schemas are very deep nested structures. In this talk, we will discuss how Spark handles nested structures in Spark 2.4, and we'll show the fundamental design issues in reading nested fields which is not being well considered when Spark SQL was designed. This results in Spark SQL reading unnecessary data in many operations. Given that Siri's data is super nested and humongous, this soon becomes a bottleneck in our pipelines. Then we will talk about the various approaches we have taken to tackle this problem. By making nested columns as first citizen in Spark SQL, we can achieve dramatic performance gain. In some of our production queries, the speed-up can be 20x in wall clock time and 8x less data being read. All of our work will be open source, and some has already been merged into upstream.

Summit Europe 2018 Pitfalls of Apache Spark at Scale

October 22, 2021 02:40 PM PT

Apple Siri is the world’s largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. A.I. and machine learning are used to personalize your experience throughout your day. The more you use, the more helpful it can be. At this scale, the information we are processing is very humongous. We use Apache Spark to get the job done.

In this talk, we will discuss the architecture of Siri data pipelines, and in particular, how Apache Spark is used to aggregate the data coming from different data centers globally into one source of truth for analytical use-cases for ML model building and productizing. We will talk about the specific techniques we use at Siri to scale, and various pitfalls we have found along the way. As part of the OSS community, we contributed back many features and bug fixes during the process; as a result, all the Spark users can get the significant run time improvement and resource savings.

Session hashtag: #SAISML1