Eunjin Song is a Senior Software Engineer at Microsoft’s Azure Data group working on Azure Synapse Analytics. She is working on Hyperspace, an indexing sub-system for Apache Spark and previously worked on in-memory database at SAP.
May 27, 2021 05:00 PM PT
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake's transaction log design and how Hyperspace enables indexing support that seamlessly works with the former's time travel queries.