Abdulrahman Alfozan is a member of the data warehouse engineering team at Facebook where he works on building and scaling Apache Spark to provide distributed computing as a service. Abdul is passionate about large-scale distributed systems and his primary focus is to enable data scientists and data engineers to be more efficient and productive when using Apache Spark. Abdul studied computer science and engineering at Massachusetts Institute of Technology.
At Facebook, Apache Spark handles large batch workloads which at times may deal with sensitive data that require protection and isolation covering all surfaces of authentication, authorization, and encryption. With jobs from multiple teams running across data-centers and geo-distributed regions, spark actors (driver, executors, shuffle service) need to securely communicate over networks spanning large geographical areas. Spark at FB also operates in a multi-tenant environment with strict access control policies that need to be enforced to guarantee data protection and job isolation. Operating at this scale presents several scalability challenges and we'll share our approach to solving a few such challenges in this talk.
More specifically, as part of this talk, we'll share how we deployed TLS encryption for Spark jobs to secure data in transit over an untrusted network, and discuss the implications and overhead of doing so. In addition to this, we'll cover how tenant isolation, security and fine-grained access control (i.e., row/column level security) are designed and implemented, along with our work on scaling the generation and validation of signed access tokens and jobs resource distribution (files, archives and jars).
Script Transformation is an important and growing use-case for Apache Spark at Facebook. Spark's script transforms allow users to run custom scripts and binaries directly from SQL and serves as an important means of stitching Facebook's custom business logic with existing data pipelines.
Along with Spark SQL + UDFs, a growing number of our custom pipelines leverage Spark's script transform operator to run user-provided binaries for applications such as indexing, parallel training and inference at scale. Spawning custom processes from the Spark executors introduces new challenges in production ranging from external resources allocation/management, structured data serialization, and external process monitoring.
In this session, we will talk about the improvements to Spark SQL (and the resource manager) to support running reliable and performant script transformation pipelines. This includes:
1) cgroup v2 containers for CPU, Memory and IO enforcement,
2) Transform jail for processes namespace management,
3) Support for complex types in Row format delimited SerDe,
4) Protocol Buffers for fast and efficient structured data serialization. Finally, we will conclude by sharing our results, lessons learned and future directions (e.g., transform pipelines resource over-subscription).