Rahul Potharaju

Principal Engineering Manager, Microsoft

Rahul Potharaju is a Principal Engineering Manager at Microsoft’s Azure Data group working on Azure Synapse Analytics. He has led several open sourcing efforts including Hyperspace and .NET for Spark. His work is widely published at top conferences, and has won awards at venues such as SIGMM and TOMM. Previously, he worked as a Researcher in the Gray System’s Laboratory (GSL) at Microsoft. He earned his Computer Science PhD degree from Purdue University in a joint industrial collaboration with Microsoft Research and Computer Science Master’s degree from Northwestern University. He is a recipient of the Motorola Engineering Excellence award and the Purdue Diamond Award. Rahul’s work has been adopted by several business groups inside Microsoft and has won the Microsoft Trustworthy Reliability.

Past sessions

Summit 2021 Hyperspace for Delta Lake

May 27, 2021 05:00 PM PT

Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake's transaction log design and how Hyperspace enables indexing support that seamlessly works with the former's time travel queries.

In this session watch:
Rahul Potharaju, Principal Engineering Manager, Microsoft
Terry Kim, Principal Software Engineer, Microsoft
Eunjin Song, Senior Software Engineer, Microsoft

[daisna21-sessions-od]

Summit Europe 2020 Hyperspace: An Indexing Subsystem for Apache Spark

November 17, 2020 04:00 PM PT

Note: This is a replay of a highly rated session from the June Spark + AI Summit. Enjoy!

At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.). Resorting to linear scans of these large datasets with huge clusters for every simple query is prohibitively expensive and not the top choice for many of our customers, who are constantly exploring (and demanding!) ways to reducing their operational costs – incurring unchecked expenses are their worst nightmare. Over the years, we have seen a huge demand for bringing ‘indexing’ capabilities that come de facto in the traditional database systems world into Apache Spark.

Among many ways to improve query performance and lowering resource consumption in database systems, indexes are particularly efficient in providing tremendous acceleration for certain workloads since they could reduce the amount of data scanned for a given query and thus also result in lowering resource costs. In this talk, we present our experiences in designing, implementing and operationalizing Hyperspace, an indexing subsystem for Apache Spark that introduces the ability for users to build, maintain (through a multi-user concurrency model) and leverage indexes (automatically, without any changes to their existing code) on their data (e.g., CSV, JSON, Parquet etc.) for query/workload acceleration. We will cover the necessary foundations behind our indexing infrastructure including the API design, how we leveraged Spark’s Catalyst optimizer to provide a transparent user experience and also discuss our development roadmap. Through presentation, benchmarks, code examples and notebooks, this will be one fun session, so come join us as we get started on this journey. Hyperspace has been recently open-sourced at https://github.com/microsoft/hyperspace

Summit 2020 Hyperspace: An Indexing Subsystem for Apache Spark

June 23, 2020 05:00 PM PT

At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, 'finding needle in a haystack' type of queries (e.g., point-lookups, summarization etc.). Resorting to linear scans of these large datasets with huge clusters for every simple query is prohibitively expensive and not the top choice for many of our customers, who are constantly exploring (and demanding!) ways to reducing their operational costs - incurring unchecked expenses are their worst nightmare. Over the years, we have seen a huge demand for bringing 'indexing' capabilities that come de facto in the traditional database systems world into Apache Spark.

Among many ways to improve query performance and lowering resource consumption in database systems, indexes are particularly efficient in providing tremendous acceleration for certain workloads since they could reduce the amount of data scanned for a given query and thus also result in lowering resource costs. In this talk, we present our experiences in designing, implementing and operationalizing Hyperspace, an indexing subsystem for Apache Spark that introduces the ability for users to build, maintain (through a multi-user concurrency model) and leverage indexes (automatically, without any changes to their existing code) on their data (e.g., CSV, JSON, Parquet etc.) for query/workload acceleration. We will cover the necessary foundations behind our indexing infrastructure including the API design, how we leveraged Spark's Catalyst optimizer to provide a transparent user experience and also discuss our development roadmap as we work towards open sourcing our work for the benefit of the broader community. Through presentation, benchmarks, code examples and notebooks, this will be one fun session, so come join us as we get started on this journey.

Summit Europe 2019 .NET for Apache Spark

October 15, 2019 05:00 PM PT

We present a new, free, open-source framework aimed at making Spark accessible to millions of .NET developers. In this session we will provide a high level overview of the .NET bindings for Spark effort, demonstrate some key capabilities on how you can use and get involved with the effort, and also cover how you can use the .NET bindings for Spark with other .NET frameworks like Databricks' Delta for building E2E real-time analytics solutions. This will be one fun session with demos galore, so come join us as we get started on the .NET bindings for Spark journey!

Summit 2019 Introducing .NET Bindings for Apache Spark

April 24, 2019 05:00 PM PT

We present a new, free, open-source framework aimed at making Spark accessible to millions of .NET developers. In this session we will provide a high level overview of the .NET bindings for Spark effort, demonstrate some key capabilities on how you can use and get involved with the effort, and also cover how you can use the .NET bindings for Spark with other .NET frameworks like ML.NET for building E2E real-time analytics solutions. This will be one fun session with demos galore, so come join us as we get started on the .NET bindings for Spark journey!