Scott Haines

Full Stack Engineer, Twilio

Scott Haines is a full stack engineer with a current focus on real-time analytics and intelligence systems. He works at Twilio, as a Senior Principal Software Engineer on the Voice Insights team, where he helped drive spark adoption, streaming pipeline architectures, and helped to architect and build out a massive stream and batch processing platform.

In addition to his role on the Voice Insights team, he is also one of the Software Architects for the Machine Learning platform at Twilio where he is helping to shape the future of experimentation, model training and secure integration.

Scott currently runs the company wide Spark Office Hours where he provides guidance, tutorials, workshops and hands-on training to engineers and teams across Twilio.
Prior to Twilio, he worked writing the backend Java API’s for Yahoo Games, as well as the real-time game ranking/ratings engine (built on Storm) to provide personalized recommendations and page views for 10 million customers. He finished his tenure at Yahoo working for Flurry Analytics where he wrote the alerts/notifications system for mobile.

Past sessions

Apache Spark™ 3.0 Deep Dives Meetup

November 17, 2020 04:00 PM PT

Jules Damji and Denny Lee from Databricks Developer Relations will recap some keynote highlights, and each will briefly present personal picks from sessions that resonated well with them. Next, Jacek Laskowski, an independent consultant, will speak about Spark 3.0 internals, and Scott Haines from Twilio, Inc. will give a talk about structured streaming microservice architectures. This live coding session and technical deep dive are not to be missed!

Join this Meetup here

Jacek Laskowski Slides

Scott Haines Slides

As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads - from real-time processing and aggregation of user / behavioral data, rule-based / conditional distribution of event and metric streams, to almost any data pipeline / lineage problems. These workloads are typical in most modern data platforms and are critical to all operational analytics systems, data storage systems, ML / DL and beyond. One of the common problems I've seen across a lot of companies can be reduced to general data reliability problems. Mainly due to scaling and migrating processing components as a company expands and teams grow. What was a few systems can quickly fan out into a slew of independent components and serving-layers all whom need to be scaled up, down or out with zero-downtime to meet the demands of a world hungry for data. During this technical deep dive, a new mental model will be built up which aims to reinvent how one should build massive, interconnected services using Kafka, Google Protocol Buffers / gRPC, and Parquet/Delta Lake/Spark Structured Streaming. The material presented during the deep dive is based on lessons learned the hard-way while building up a massive real-time insights platform at Twilio where data integrity and stream fault-tolerance is as critical as the services our company provides.

Time is the one thing we can never get in front of. It is rooted in everything, and "timeliness" is now more important than ever especially as we see businesses automate more and more of their processes. This presentation will scratch the surface of streaming discovery with a deeper dive into the telecommunications space where it is normal to receive billions of events a day from globally distributed sub-systems and where key decisions "must" be automated.

We'll start out with a quick primer on telecommunications, an overview of the key components of our architecture, and make a case for the importance of "ringing". We will then walk through a simplified solution for doing windowed histogram analysis and labeling of data in flight using Spark Structured Streaming and mapGroupsWithState. I will walk through some suggestions for scaling up to billions of events, managing memory when using the spark StateStore as well as how to avoid pitfalls with the serialized data stored there.

What you will learn:
1. How to use the new features of Spark 2.2.0 (mapGroupsWithState / StateStore)
2. How to bucket and analyze data in the streaming world
3. How to avoid common Serialization mistakes (eg. how to upgrade application code and retain stored state)
4. More about the telecommunications space than you'll probably want to know!
5. Learn a new approach to building applications for enterprise and production.

Assumptions:
1. You know Scala - or want to know more about it.
2. You have deployed spark to production at your company or want to
3. You want to learn some neat tricks that may save you tons of time!

Takeaways:
1. Fully functioning spark app—with unit tests!

Time is the one thing we can never get in front of. It is rooted in everything, and "timeliness" is now more important than ever especially as we see businesses automate more and more of their processes. This presentation will scratch the surface of streaming discovery with a deeper dive into the telecommunications space where it is normal to receive billions of events a day from globally distributed sub-systems and where key decisions "must" be automated.

We'll start out with a quick primer on telecommunications, an overview of the key components of our architecture, and make a case for the importance of "ringing". We will then walk through a simplified solution for doing windowed histogram analysis and labeling of data in flight using Spark Structured Streaming and mapGroupsWithState. I will walk through some suggestions for scaling up to billions of events, managing memory when using the spark StateStore as well as how to avoid pitfalls with the serialized data stored there.

What you'll learn:
1. How to use the new features of Spark 2.2.0 (mapGroupsWithState / StateStore)
2. How to bucket and analyze data in the streaming world
3. How to avoid common Serialization mistakes (eg. how to upgrade application code and retain stored state)
4. More about the telecommunications space than you'll probably want to know!
5. Learn a new approach to building applications for enterprise and production.

Assumptions:
1. You know Scala - or want to know more about it.
2. You have deployed spark to production at your company or want to
3. You want to learn some neat tricks that may save you tons of time!

Take Aways:
1. Fully functioning spark app - with unit tests!

Session hashtag: #DDSAIS16