Script Transformation is an important and growing use-case for Apache Spark at Facebook. Spark’s script transforms allow users to run custom scripts and binaries directly from SQL and serves as an important means of stitching Facebook’s custom business logic with existing data pipelines.
Along with Spark SQL + UDFs, a growing number of our custom pipelines leverage Spark’s script transform operator to run user-provided binaries for applications such as indexing, parallel training and inference at scale. Spawning custom processes from the Spark executors introduces new challenges in production ranging from external resources allocation/management, structured data serialization, and external process monitoring.
In this session, we will talk about the improvements to Spark SQL (and the resource manager) to support running reliable and performant script transformation pipelines. This includes:
1) cgroup v2 containers for CPU, Memory and IO enforcement,
2) Transform jail for processes namespace management,
3) Support for complex types in Row format delimited SerDe,
4) Protocol Buffers for fast and efficient structured data serialization. Finally, we will conclude by sharing our results, lessons learned and future directions (e.g., transform pipelines resource over-subscription).
Abdulrahman Alfozan is a member of the data warehouse engineering team at Facebook where he works on building and scaling Apache Spark to provide distributed computing as a service. Abdul is passionate about large-scale distributed systems and his primary focus is to enable data scientists and data engineers to be more efficient and productive when using Apache Spark. Abdul studied computer science and engineering at Massachusetts Institute of Technology.