Composable Data Processing with Apache Spark

As the usage of Apache Spark continues to ramp up within the industry, a major challenge has been scaling our development. Too often we find that developers are re-implementing a similar set of cross-cutting concerns, sprinkled with some variance of use-case specific business logic as a concrete Spark App. The consequences of this anti-pattern are significant. Cross Cutting logic is re-implemented again and again. Each isolated Spark App is responsible for its own resiliency, scalability, monitoring, and error handling. Attempting to weave together data as it flows across these Apps is highly inefficient. Pipelining data through one or more of these apps requires multiple rounds of loading and saving data to disk increasing the overall cost and risk of failure.

In addition, there is no consolidated error handling when chaining multiple Spark Apps. In this talk we will walk through the problems that led us to an extensible plugin framework, SIP, implemented to address these issues. SIP is used extensively in Adobe’s Experience Platform (AEP) for data processing. The framework enables us to support a number of complex use-cases by composing one or more simpler data conversion and/or validation operations. SIP is hosted internally, allowing a community of engineers to plugin code and benefit from the resiliency, scaling, and monitoring invested in existing infrastructure. Finally, we will dive deep into SIP’s detailed error reporting and how it enables us to provide a much improved user-experience to our customers.


 
Try Databricks
« back
About Dilip Biswal

Adobe, Inc.

Dilip Biswal is a Software Architect at Adobe working on Adobe Experience Platform. He is an active Apache Spark contributor and works in the open source community. He is experienced in Relational Databases, Distributed Computing and Big Data Analytics. He has extensively worked on SQL engines like Informix, Derby, and Big SQL.

About Shone Sadler

Adobe, Inc.

Shone is a glorified plumber (aka Principle Scientist) at Adobe Systems responsible for siphoning data into Adobe's Digital Marketing Cloud. Back in the day, he was a chief architect at Q-Link Systems, a leader in Business Process Management. It was in 2004 Shone when joined Adobe as an Architect of its Livecycle Document Platform helping lead Adobe's initial foray into the enterprise. Shone received his Masters in MIS from Depaul University in 2000 and subsequently a Masters in Programming Languages from Georgia Institute of Technology.