Santiago Mola is a Big Data Developer at Stratio. He works on projects with Apache Spark Streaming and SQL and is currently helping to build the integration of Apache Spark with SAP HANA Vora. Santiago has worked previously as a researcher in the Machine Learning field and has contributed to Open Source projects for 9 years.
The DataFrame API of Spark SQL allows the easy integration of external sources such as SQL Databases, CSV files or Avro sources. In addition to this, Spark uses the computational capabilities of sources to 'pushdown' projects as well as filters on the data source. This prunes unnecessary data right in the source, reducing evaluation time on Spark level. However, sources such as SQL Engines also allow the evaluation of more complex parts of the logical plan. Enabling such capabilities of the sources promises a huge performance boost: For example, evaluating aggregates or joins directly in the source reduces the amount of copied data dramatically. This is challenging because this requires rewriting of the logical plans depending on the used features and the partitioning of the data. In this talk we will present an extension of the data source API allowing the pushdown of arbitrary elements of the logical plan. This includes that sources could announce their capabilities if they supported by the underlying system. In addition to that, we implemented the extended data source API on top of HANA as well as a newly developed lightweight inmemory processing engine developed at SAP. We show that the extension improves performance of Spark SQL in combination with HANA and the lightweight engine. In addition to that, we give insights in how the functionality can be used for arbitrary data sources.