In this presentation, Vineet will be explaining case study of one of my customers using Spark to migrate terabytes of data from GPFS into Hive tables. The ETL pipeline was built purely using Spark. The pipeline extracted target (Hive) table properties such as – identification of Hive Date/Timestamp columns, whether target table is partitioned or non-partitioned, target storage formats (Parquet or Avro) and source to target columns mappings. These target tables contain few to hundreds of columns and non standard date fomats into Hive standard timestamp format.
Vineet Kumar works in banking industry and has 16 years of IT experience, and has played many key roles including developer and data architect. For last fours years focussed in Hadoop and Spark. In his free time, he writes technical blogs. Vineet is passionate about open source products. His goal is to provide business data in real time to business users for analytics.