Hudi: Large-Scale, Near Real-Time Pipelines at Uber - Databricks

Hudi: Large-Scale, Near Real-Time Pipelines at Uber

Download Slides

Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL. Head over here for more details : https://github.com/uber/hudi

Session hashtag: #SAISEco10

About Nishith Agarwal

Nishith is an early engineer of the data team at Uber, as well as an initial committer of "Hudi." He has a keen interest in unified architectures for data analytics and processing. In the past, he has built distributed data systems that leverage both batch and stream processing.

About Vinoth Chandar

Vinoth is the founding engineer/architect of the data team at Uber, as well as author of many data processing & querying systems at Uber, including "Hoodie". He has keen interest in unified architectures for data analytics and processing. Previously, Vinoth was the lead on Linkedin’s Voldemort key value store and has also worked on Oracle Database replication engine, HPC, and stream processing.