Pitfalls of Apache Spark at Scale

Download Slides

Apple Siri is the world’s largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. A.I. and machine learning are used to personalize your experience throughout your day. The more you use, the more helpful it can be. At this scale, the information we are processing is very humongous. We use Apache Spark to get the job done.

In this talk, we will discuss the architecture of Siri data pipelines, and in particular, how Apache Spark is used to aggregate the data coming from different data centers globally into one source of truth for analytical use-cases for ML model building and productizing. We will talk about the specific techniques we use at Siri to scale, and various pitfalls we have found along the way. As part of the OSS community, we contributed back many features and bug fixes during the process; as a result, all the Spark users can get the significant run time improvement and resource savings.

Session hashtag: #SAISML1

« back
About Cesar Delgado

Cesar has been involved with Big Data since 2008 and been working on Siri since the Apple acquisition. He has also worked on other teams at Apple including iTunes, iCloud, News and Maps helping with processing pipelines and architecture.

About DB Tsai

DB Tsai is an Apache Spark PMC / Committer and an open source and big data engineer at Apple. He implemented several algorithms including linear models with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. Prior to joining Apple, DB worked on Personalized Recommendation ML Algorithms at Netflix. DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master's degree in Electrical Engineering from Stanford.