Vinoth is the founding engineer/architect of the data team at Uber, as well as author of many data processing & querying systems at Uber, including “Hoodie”. He has keen interest in unified architectures for data analytics and processing. Previously, Vinoth was the lead on Linkedin’s Voldemort key value store and has also worked on Oracle Database replication engine, HPC, and stream processing.
Apache Sqoop has been used primarily for transfer of data between relational databases and HDFS, leveraging the Hadoop Mapreduce engine. Recently the Sqoop community has made changes to allow data transfer across any two data sources represented in code by Sqoop connectors. For instance, it’s possible to use the latest Apache Sqoop to transfer data from MySQL to kafka or vice versa via the jdbc connector and kafka connector, respectively. This talk will focus on running Sqoop jobs on Apache Spark engine and proposed extensions to the APIs to use the Spark functionality. We’ll discuss the design options explored and implemented to submit jobs to the Spark engine. We’ll do a demo of one of the Sqoop job flows on Apache spark and how to use the Sqoop job APIs to monitor the Sqoop jobs. The talk will conclude use cases for Sqoop and Spark at Uber.Learn more:
Prasanna Rajaperumal and Vinoth Chandar will explore a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Prasanna will discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS. Session hashtag: #SFexp4