Near real-time analytics has become a common requirement for many data teams as the technology has caught up to the demand. One of the hardest aspects of enabling near-realtime analytics is making sure the source data is ingested and deduplicated often enough to be useful to analysts while writing the data in a format that is usable by your analytics query engine. This is usually the domain of many tools since there are three different aspects of the problem: streaming ingestion of data, deduplication using an ETL process, and interactive analytics. With Spark, this can be done with one tool.
This talk with walk you through how to use Spark Streaming to ingest change-log data, use Spark batch jobs to perform major and minor compaction, and query the results with Spark.SQL. At the end of this talk you will know what is required to setup near-realtime analytics at your organization, the common gotchas including file formats and distributed file systems, and how to handle data the unique data integrity issues that arise from near-realtime analytics.
Brandon is a principal data engineer at Eventbrite. He began using Spark in 2014 to help law enforcement find and recover victims of human trafficking. Lately he's been been dedicated to building Eventbrite's data infrastructure around Apache Spark and related tools.
Beck is a data engineer focused on building scalable data pipelines and infrastructure leveraging Apache Spark. With over 4 years of engineering experience at Eventbrite, she works towards making data more accessible and reliable for analysts, data scientists, engineers and decision makers. She is responsible for deploying machine learning models and exporting predictions to the product. Beck delivers performant, optimized solutions around varied types of data problems within Eventbrite.