Spark applications are composed during design time and then submitted to the central (master) component who then distributes the code among several worker nodes. However, in many situations, the application is not static: the developers add new processing steps, data scientists adjust parameters of their algorithm, etc. To update the application, it has to be resubmitted and restarted. Unfortunately, restarting the processing pipeline safely is hard: intermediate state is lost and need to be re-computed, and it is especially critical for real-time streaming components that require 24×7 availability. In this work we investigate the possibility of updating spark applications without restarting them. First and foremost assumption is that the spark library should remain unchanged, and we instead provide new processing primitives, such as dynamicMap and dynamicReduce on top of traditional map and reduce functions with the difference that they can be updated on-the-fly. We looked into two update scenarios: updating of an individual parameter and updating/replacing a function within map and/or reduce processing step. Our experiments show that the overhead stays within 1-2% during normal operations (without triggering updates), and rises up to 10% during the update. In the second half of the presentation we will look into dynamic changes of data sources. We will compare two approaches: first, based on changing the data source within the Spark pipeline itself. We build a custom data source that can combine several existing data sources (in our example we use Cassandra and HDFS data sources). For the second approach, we move the component outside of the Spark platform and connect to it via custom receiver. For the first approach, the experiments have shown no noticeable difference with the normal Spark operations. Second approach resulted in about 10% overhead. Our experiments were conducted on a cluster of 6 physical servers.
I am a Research Scientist at TNO (www.tno.nl, Netherlands), where I am a member of Monitoring and Control Services group. As a part of my daily work, I am responsible for the infrastructure for processing of large amount of data coming from different sources. We apply our solutions across different domains to perform continuous monitoring and control of critical infrastructures: from dikes and electrical grids to automotive sector and building structural integrity. I finished Bachelor in Computer Science at Trento University (Italy), and obtained Msc in CS with a specialization in distributed systems and software engineering at University of Groningen,