We can think of an Apache Spark application as the unit of work in complex data workflows. Building a configurable and reusable Apache Spark application comes with its own challenges, especially for developers that are just starting in the domain. Configuration, parametrization, and reusability of the application code can be challenging. Solving these will allow the developer to focus on value-adding work instead of mundane tasks such as writing a lot of configuration code, initializing the SparkSession or even kicking-off a new project. This presentation will describe using code samples a developer’s journey from the first steps into Apache Spark all the way to a simple open-source framework that can help kick-off an Apache Spark project very easy, with a minimal amount of code. The main ideas covered in this presentation are derived from the separation of concerns principle. The first idea is to make it even easier to code and test new Apache Spark applications by separating the application logic from the configuration logic. The second idea is to make it easy to configure the applications, providing SparkSessions out-of-the-box, easy to set-up data readers, data writers and application parameters through configuration alone. The third idea is that taking a new project off the ground should be very easy and straightforward. These three ideas are a good start in building reusable and production-worthy Apache Spark applications. The resulting framework, spark-utils, is already available and ready to use as an open-source project, but even more important are the ideas and principles behind it.
Oliver Tupran is a software engineer with more than 20 years of professional experience in various fields, like aviation, telecommunication, software modelling tools and banking. Since 2015 he is focused on building Apache Spark applications in Scala. In the open source world, he is focusing on creating tools and frameworks. He is currently working on making it easy to develop Spark Streaming applications, as well as online analytics, online anomaly detection systems and machine learning algorithms.