Building Data Pipelines with Spark and StreamSets - Databricks

Building Data Pipelines with Spark and StreamSets

Download Slides

Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Metadata in upstream sources can ‘drift’ due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an Apache 2.0 licensed open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we’ll look at how SDC’s “intent-driven” approach keeps the data flowing, with a particular focus on clustered deployment with Spark and other exciting Spark integrations in the works.

About Pat Patterson

Pat Patterson has been working with Internet technologies since 1997, building software and working with communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. Part of the developer evangelism team at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.