Brandon is a principal data engineer at Eventbrite. He began using Spark in 2014 to help law enforcement find and recover victims of human trafficking. Lately he’s been been dedicated to building Eventbrite’s data infrastructure around Apache Spark and related tools.
April 24, 2019 05:00 PM PT
Near real-time analytics has become a common requirement for many data teams as the technology has caught up to the demand. One of the hardest aspects of enabling near-realtime analytics is making sure the source data is ingested and deduplicated often enough to be useful to analysts while writing the data in a format that is usable by your analytics query engine. This is usually the domain of many tools since there are three different aspects of the problem: streaming ingestion of data, deduplication using an ETL process, and interactive analytics. With Spark, this can be done with one tool.
This talk with walk you through how to use Spark Streaming to ingest change-log data, use Spark batch jobs to perform major and minor compaction, and query the results with Spark.SQL. At the end of this talk you will know what is required to setup near-realtime analytics at your organization, the common gotchas including file formats and distributed file systems, and how to handle data the unique data integrity issues that arise from near-realtime analytics.
October 22, 2021 02:29 PM PT
Deploying machine learning models seems like it should be a relatively easy task. Take your model and pass it some features in production. The reality is that the code written during the prototyping phase of model development doesn't always work when applied at scale or on "real" data. This talk will explore 1) common problems at the intersection of data science and data engineering 2) how you can structure your code so there is minimal friction between prototyping and production, and 3) how you can use Apache Spark to run predictions on your models in batch or streaming contexts.
You will take away how to address some of productionizing issues that data scientists and data engineers face while deploying machine learning models at scale and a better understanding of how to work collaboratively to minimize disparity between prototyping and productizing.
Session hashtag: #SAISDS2