Vinoth is the founding engineer/architect of the data team at Uber, as well as author of many data processing & querying systems at Uber, including “Hoodie”. He has keen interest in unified architectures for data analytics and processing. Previously, Vinoth was the lead on Linkedin’s Voldemort key value store and has also worked on Oracle Database replication engine, HPC, and stream processing.
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of 'Hudi', a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL. Head over here for more details : https://github.com/uber/hudi Session hashtag: #SAISEco10
Apache Sqoop has been used primarily for transfer of data between relational databases and HDFS, leveraging the Hadoop Mapreduce engine. Recently the Sqoop community has made changes to allow data transfer across any two data sources represented in code by Sqoop connectors. For instance, it’s possible to use the latest Apache Sqoop to transfer data from MySQL to kafka or vice versa via the jdbc connector and kafka connector, respectively. This talk will focus on running Sqoop jobs on Apache Spark engine and proposed extensions to the APIs to use the Spark functionality. We’ll discuss the design options explored and implemented to submit jobs to the Spark engine. We’ll do a demo of one of the Sqoop job flows on Apache spark and how to use the Sqoop job APIs to monitor the Sqoop jobs. The talk will conclude use cases for Sqoop and Spark at Uber.Learn more:
Prasanna Rajaperumal and Vinoth Chandar will explore a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Prasanna will discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS. Session hashtag: #SFexp4