Ioana Delaney is a Senior Software Engineer in Silicon Valley Laboratory in San Jose, California. She was part of the DB2 for LUW development team until she recently joined Spark Technology Center at IBM. She worked in many areas of SQL and XML query compilation, including query semantics, query rewrite, query optimization, and federated/distributed compiler.
An informational, or statistical, constraint is a constraint such as a unique, primary key, foreign key, or check constraint that can be used by Apache Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. Informational constraints will be primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations. This session will cover the support for primary and foreign key (referential integrity) constraints in Spark. You'll learn about the constraint specification, metastore storage, constraint validation and maintenance. You'll also see examples of query optimizations that utilize referential integrity constraints, such as Join and Distinct elimination and Star Schema detection. Session hashtag: #SFdev21
When Spark applications operate on distributed data coming from disparate data sources, they often have to directly query data sources external to Spark such as backing relational databases, or data warehouses. For that, Spark provides Data Source APIs, which are a pluggable mechanism for accessing structured data through Spark SQL. Data Source APIs are tightly integrated with the Spark Optimizer. They provide optimizations such as filter push down to the external data source and column pruning. While these optimizations significantly speed up Spark query execution, depending on the data source, they only provide a subset of the functionality that can be pushed down and executed at the data source. As part of our ongoing project to provide a generic data source push down API, this presentation will show our work related to join push down. An example is star-schema join, which can be simply viewed as filters applied to the fact table. Today, Spark Optimizer recognizes star-schema joins based on heuristics and executes star-joins using efficient left-deep trees. An alternative execution proposed by this work is to push down the star-join to the external data source in order to take advantage of multi-column indexes defined on the fact tables, and other star-join optimization techniques implemented by the relational data source. Session hashtag: #EUdev7