Tagging and Processing Data in Real-Time Using Spark Streaming

Download Slides

Apache Spark is a flexible, scalable and fault-tolerant data processing framework that specializes in processing large amount of data. Spark Streaming builds on top of the core library to consume data from ingest systems like Apache Kafka, Apache Flume, Amazon Kinesis etc., in real time. In this talk, we will talk about the recent advances in Spark Streaming – the design of several new features that have improved performance and eliminated any possibility of data loss. We will discuss the use of Spark Streaming at Salesforce.com to normalize data coming in from a variety of sources in real-time and how this normalized data is then tagged and made available to downstream applications for consumption. We will discuss the integration of Spark Streaming with Kafka in both directions and how such an integration is important for this use-case.

About Hari Shreedharan

Hari Shreedharan is a Software Engineer at Streamsets, where he builds products to make data ingest easy. Previously, he was a Software Engineer at Cloudera, where he worked on Apache Spark, Apache Flume and Apache Sqoop. He is also the PMC chair of the Apache Flume project.