Suresh Thalamati

Advisory Software Engineer, IBM

Suresh Thalamati is an Advisory software engineer at the Spark Technology Center at IBM. He is Apache Spark contributor and works in the open source community. He is a Apache Derby committer and a PMC member. He is experienced in Relational Databases, Distributed Computing and Big Data Analytics with focus on Hadoop MapReduce technologies.


Informational Referential Integrity Constraints Support in Apache Spark

An informational, or statistical, constraint is a constraint such as a unique, primary key, foreign key, or check constraint that can be used by Apache Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. Informational constraints will be primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations. This session will cover the support for primary and foreign key (referential integrity) constraints in Spark. You'll learn about the constraint specification, metastore storage, constraint validation and maintenance. You'll also see examples of query optimizations that utilize referential integrity constraints, such as Join and Distinct elimination and Star Schema detection. Session hashtag: #SFdev21