Sebastian is a PhD student at the Information Systems Group at the Hasso Plattner Institute, Potsdam, Germany. This is where he also did his Master’s degree in IT systems engineering. In general, he’s passionate and excited about leading edge data processing technology, efficient algorithms, and elegant system designs. More concretely, his main line of research focuses on distributed algorithms to discover dependencies from data—these are problems of high runtime and space complexity that must be tackled with the greatest care. While interning at the Data Analytics Group at the Qatar Research Computing Institute, he further picked up the topic of integrating heterogeneous data processing platforms and played a leading role in the design and implementation of Rheem. Last but not least, he’s devoted to open source software. Not only have most of his research project outcomes been open sourced (Rheem, in particular), but he also contributes to Apache projects.
We are witnessing a proliferation of big data, which has lead to a zoo of data processing systems. Each system providing a different set of features. For example, Spark provides scalability to analytic tasks, but Java 8 Streams provides low-latency. Furthermore, complex applications, such as ETL and ML, are now requiring a mixture of platforms to perform tasks efficiently. In such complex data analytics pipelines, the use of multiple data processing system is not only for performance reasons, but also because of data diversity. Datasets often natively reside on different data formats and storage engines. Unfortunately, developers are left alone in the challenging tasks of: (1) choosing the right platform for their applications; and (2) performing tedious and costly data migration and integration tasks to obtain the results. In this talk, we will present Rheem, an open source scalable cross-platform system that frees developers from these burdens. Rheem provides an abstraction layer on top of Spark (and other processing platforms) with the aim of enabling cross-platform optimization and interoperability. It automatically selects the best data processing platforms for a given task and also handles the cross-platform execution. In particular, we will discuss how Rheem allows Spark to work in tandem with other platforms in order to achieve higher performance. We will also show how easy a developer can write complex applications on top of Rheem to seamlessly use multiple different data processing platforms according to their tasks at hand. Using Rheem developers do not have to worry about the integration or data migration between Spark and other platforms. Session hashtag: #SFeco15