Apache Spark Based Reliable Data Ingestion in Datalake – Databricks

Apache Spark Based Reliable Data Ingestion in Datalake

Download Slides

Ingesting data from variety of sources like Mysql, Oracle, Kafka, Sales Force, Big Query, S3, SaaS applications, OSS etc. with billions of records into datalake (for reporting, adhoc analytics, ML jobs) with reliability, consistency, schema evolution support and within expected SLA has always been a challenging job. Also ingestion may have different flavors like full ingestion, incremental ingestion with and without compaction/de-duplication and transformations with their own complexity of state management and performance. Not to mention dependency management where hundreds / thousands of downstream jobs are dependent on this ingested data and hence data availability on time is of utmost importance. Most data teams end up creating adhoc ingestion pipelines written in different languages and technologies which adds operational overheads and knowledge is mostly limited to few.

In this session, I will talk about how we leveraged Sparks Dataframe abstraction for creating generic ingestion platform capable of ingesting data from varied sources with reliability, consistency, auto schema evolution and transformations support. Will also discuss about how we developed spark based data sanity as one of the core components of this platform to ensure 100% correctness of ingested data and auto-recovery in case of inconsistencies found. This talk will also focus how Hive table creation and schema modification was part of this platform and provided read time consistencies without locking while Spark Ingestion jobs were writing on the same Hive tables and how we maintained different versions of ingested data to do any rollback if required and also allow users of this ingested data to go back in time and read snapshot of ingested data at that moment.

Post this talk one should be able to understand challenges involved in ingesting data reliably from different sources and how one can leverage Spark’s Dataframe abstraction to solve this in unified way.

Session hashtag: #SAISDev13

« back
About Gagan Agrawal

Gagan has over 12 years of industry experience and currently working with Paytm (India's Largest Payment Platform) as Data Architect. He like to create data centric generic products and have over 5 years of Big Data experience and worked on products / frameworks around ingestion, stream processing, batch computation, ad-hoc analytics etc. In his previous work, he created custom SQL like expression language and stream processing engine based on concept of Profile Tree. His latest interest has been in solving data ingestion problem from varied sources with reliability and consistency.