Stephan Kessler is a developer in a Research and Development Team at SAP Walldorf. He is working on the integration of SAPs query execution engines in the Spark eco-system. His main goals are improving the speed of Spark processing even more and bringing new features to the SQL extension. Before joining SAP, he did his PhD and his Diploma (M.Sc.) at the Karlsruhe Institute of Technology at the chair of database and information systems. Before joining the Big Data community his research interest covered privacy in databases as well as sensor networks.
The DataFrame API of Spark SQL allows the easy integration of external sources such as SQL Databases, CSV files or Avro sources. In addition to this, Spark uses the computational capabilities of sources to 'pushdown' projects as well as filters on the data source. This prunes unnecessary data right in the source, reducing evaluation time on Spark level. However, sources such as SQL Engines also allow the evaluation of more complex parts of the logical plan. Enabling such capabilities of the sources promises a huge performance boost: For example, evaluating aggregates or joins directly in the source reduces the amount of copied data dramatically. This is challenging because this requires rewriting of the logical plans depending on the used features and the partitioning of the data. In this talk we will present an extension of the data source API allowing the pushdown of arbitrary elements of the logical plan. This includes that sources could announce their capabilities if they supported by the underlying system. In addition to that, we implemented the extended data source API on top of HANA as well as a newly developed lightweight inmemory processing engine developed at SAP. We show that the extension improves performance of Spark SQL in combination with HANA and the lightweight engine. In addition to that, we give insights in how the functionality can be used for arbitrary data sources.
The challenge of computing big data for evolving digital business processes demands new approaches to processing, interpreting and correlating disparate data sources. Variety of enterprise use cases demand variety of computation techniques and engines (SQL, OLAP, time-series, graph, document store), but working in unified framework that offers access to those sources in a seamless manner is necessary for enterprise production environments. In this talk we will highlight how new technologies in SAP HANA Vora extends Apache Spark to impact everyday business operations.