Hari Shreedharan is a Software Engineer at Streamsets, where he builds products to make data ingest easy. Previously, he was a Software Engineer at Cloudera, where he worked on Apache Spark, Apache Flume and Apache Sqoop. He is also the PMC chair of the Apache Flume project.
Apache Spark is a flexible, scalable and fault-tolerant data processing framework that specializes in processing large amount of data. Spark Streaming builds on top of the core library to consume data from ingest systems like Apache Kafka, Apache Flume, Amazon Kinesis etc., in real time. In this talk, we will talk about the recent advances in Spark Streaming – the design of several new features that have improved performance and eliminated any possibility of data loss. We will discuss the use of Spark Streaming at Salesforce.com to normalize data coming in from a variety of sources in real-time and how this normalized data is then tagged and made available to downstream applications for consumption. We will discuss the integration of Spark Streaming with Kafka in both directions and how such an integration is important for this use-case.
Streamsets Data Collector is designed to make data ingest and processing easy. SDC integrates at several levels with Apache Spark to make data analysis using Spark very easy. SDC works with Databricks Cloud to trigger jobs based on incoming data. In this talk, you will learn how a larger retail player with thousands of outlets is utilizing StreamSets to power Spark jobs on the Databricks cloud, combining real-time foot traffic data and historic behavioral & transaction data for analytic insights that improve revenue per square foot.