Uber's Batch Analytics Evolution from Hive to Spark
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Data Warehousing - Analytics and BI |
INDUSTRY | Travel and Hospitality |
TECHNOLOGIES | Apache Spark |
SKILL LEVEL | Advanced |
DURATION | 40 min |
DOWNLOAD SESSION SLIDES |
About 40% of Uber's substantial ETL expenses, amounting to multimillion dollars, were associated with ETL processes on Hive. At Uber, approximately 30,000 ETL workflows and approximately two million weekly queries utilized Hive for various purposes, including ML, Compliance/Regulatory Reporting, Finance, and Product Development. As a strategic move to enhance efficiency in Batch Analytics, Uber decided to migrate all Hive workloads to SparkSQL.
This migration included the development of automation features such as transpilation of Hive queries to SparkSQL, parallel execution on Spark, and the implementation of a validation framework for data correctness and performance. This session will explore Uber’s auto-migration framework's architecture in-depth, addressing challenges encountered throughout the migration process and their effective resolutions. Additionally, insights into the overall efficiency gains from this migration will be shared.
SESSION SPEAKERS
Kumudini Kakwani
/Senior Software Engineer
Uber
IMAGE COMING SOON
Akshayaprakash Sharma
/Senior Software Engineer
UBER