Nishith is an early engineer of the data team at Uber, as well as an initial committer of “Hudi.” He has a keen interest in unified architectures for data analytics and processing. In the past, he has built distributed data systems that leverage both batch and stream processing.
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of 'Hudi', a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL. Head over here for more details : https://github.com/uber/hudi Session hashtag: #SAISEco10