One of the most significant benefits provided by Databricks Delta is the ability to use z-ordering and dynamic file pruning to significantly reduce the amount of data that is retrieved from blob storage and therefore drastically improve query times, sometimes by an order of magnitude.
However, taking advantage of this approach, over petabytes of geospatial data requires specific techniques, both in how the data is generated, and in designing the SQL queries to ensure that dynamic file pruning is included in the query plan.
This presentation will demonstrate these optimisations on real world data, showing the pitfalls involved with the current implementation and the workarounds required, and the spectacular query performance that can be achieved when it works correctly.
Speaker: Matt Slack
Matt leads data strategy at Wejo, coming up with innovative ways of processing petabytes of connected car data, in both stream and batch, using technologies such as Spark, Kafka, Kafka Streams and Akka. He is a strong advocate of Spark, having led on implementations of Cloudera and Databricks in his most recent roles. Matt can regularly be found leading Spark training sessions, or getting stuck into the latest performance tuning challenge.