Scott Haines - Databricks

Scott Haines

Full Stack Engineer, Twilio

Scott Haines is a full stack engineer with a current focus on real-time -highly available – trust-worthy analytics systems. He is currently working at Twilio (as Tech Lead of the Voice Insights team) where he helped drive spark adoption and streaming pipeline architectures. Prior to Twilio, he worked writing the backend java api’s for Yahoo Games, as well as the real-time game ranking / ratings engine (built on Storm) to provide personalized recommendations and page views for 10 million customers. He finished his tenure at Yahoo working for Flurry Analytics where he wrote the (alerts/notifications) system for mobile.

UPCOMING SESSIONS

Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and FriendsSummit 2020

As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads - from real-time processing and aggregation of user / behavioral data, rule-based / conditional distribution of event and metric streams, to almost any data pipeline / lineage problems. These workloads are typical in most modern data platforms and are critical to all operational analytics systems, data storage systems, ML / DL and beyond. One of the common problems I've seen across a lot of companies can be reduced to general data reliability problems. Mainly due to scaling and migrating processing components as a company expands and teams grow. What was a few systems can quickly fan out into a slew of independent components and serving-layers all whom need to be scaled up, down or out with zero-downtime to meet the demands of a world hungry for data. During this technical deep dive, a new mental model will be built up which aims to reinvent how one should build massive, interconnected services using Kafka, Google Protocol Buffers / gRPC, and Parquet/Delta Lake/Spark Structured Streaming. The material presented during the deep dive is based on lessons learned the hard-way while building up a massive real-time insights platform at Twilio where data integrity and stream fault-tolerance is as critical as the services our company provides.

PAST SESSIONS

Streaming Trend Discovery: Real-Time Discovery in a Sea of Events – continuesSummit 2018

Time is the one thing we can never get in front of. It is rooted in everything, and "timeliness" is now more important than ever especially as we see businesses automate more and more of their processes. This presentation will scratch the surface of streaming discovery with a deeper dive into the telecommunications space where it is normal to receive billions of events a day from globally distributed sub-systems and where key decisions "must" be automated. We'll start out with a quick primer on telecommunications, an overview of the key components of our architecture, and make a case for the importance of "ringing". We will then walk through a simplified solution for doing windowed histogram analysis and labeling of data in flight using Spark Structured Streaming and mapGroupsWithState. I will walk through some suggestions for scaling up to billions of events, managing memory when using the spark StateStore as well as how to avoid pitfalls with the serialized data stored there. What you will learn: 1. How to use the new features of Spark 2.2.0 (mapGroupsWithState / StateStore) 2. How to bucket and analyze data in the streaming world 3. How to avoid common Serialization mistakes (eg. how to upgrade application code and retain stored state) 4. More about the telecommunications space than you'll probably want to know! 5. Learn a new approach to building applications for enterprise and production. Assumptions: 1. You know Scala - or want to know more about it. 2. You have deployed spark to production at your company or want to 3. You want to learn some neat tricks that may save you tons of time! Takeaways: 1. Fully functioning spark app—with unit tests!

Streaming Trend Discovery: Real-Time Discovery in a Sea of EventsSummit 2018

Time is the one thing we can never get in front of. It is rooted in everything, and "timeliness" is now more important than ever especially as we see businesses automate more and more of their processes. This presentation will scratch the surface of streaming discovery with a deeper dive into the telecommunications space where it is normal to receive billions of events a day from globally distributed sub-systems and where key decisions "must" be automated. We'll start out with a quick primer on telecommunications, an overview of the key components of our architecture, and make a case for the importance of "ringing". We will then walk through a simplified solution for doing windowed histogram analysis and labeling of data in flight using Spark Structured Streaming and mapGroupsWithState. I will walk through some suggestions for scaling up to billions of events, managing memory when using the spark StateStore as well as how to avoid pitfalls with the serialized data stored there. What you'll learn: 1. How to use the new features of Spark 2.2.0 (mapGroupsWithState / StateStore) 2. How to bucket and analyze data in the streaming world 3. How to avoid common Serialization mistakes (eg. how to upgrade application code and retain stored state) 4. More about the telecommunications space than you'll probably want to know! 5. Learn a new approach to building applications for enterprise and production. Assumptions: 1. You know Scala - or want to know more about it. 2. You have deployed spark to production at your company or want to 3. You want to learn some neat tricks that may save you tons of time! Take Aways: 1. Fully functioning spark app - with unit tests! Session hashtag: #DDSAIS16