Erik Erlandson

Software Engineer, Red Hat

Erik Erlandson is a Software Engineer at Red Hat, where he investigates analytics use cases and scalable deployments for Apache Spark in the cloud. He also consults on internal data science and analytics projects. Erik is a contributor to Apache Spark and other open source projects in the Spark ecosystem, including the Spark on Kubernetes community project, Algebird and Scala..

Past sessions

Summit 2020 User Defined Aggregation in Apache Spark: A Love Story

June 25, 2020 05:00 PM PT

Defining customized scalable aggregation logic is one of Apache Spark's most powerful features. User Defined Aggregate Functions (UDAF) are a flexible mechanism for extending both Spark data frames and Structured Streaming with new functionality ranging from specialized summary techniques to building blocks for exploratory data analysis. And yet as powerful as they are, UDAFs prior to Spark 3.0 have had subtle flaws that can undermine both performance and usability.

In this talk Erik will tell the story about how he met UDAFs and fell in love with their powerful features. He'll describe how he faced challenges with the UDAF design and its performance properties and how, with the help of the Apache Spark community, he eventually fixed the UDAF design in Spark 3.0 and fell in love all over again. Along the way you'll learn about how User Defined Aggregation works in Spark, how to write your own UDAF library and how Spark's newest UDAF features improve both usability and performance. You'll also hear how Spark's code review process made these new features even better and learn tips for successfully shepherding a large feature into the Apache Spark upstream community.

Summit Europe 2018 Apache Spark for Library Developers Part 2

October 2, 2018 05:00 PM PT

Extended Session - Continue Video >

As a developer, data engineer, or data scientist, you've seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you're solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.

You faced a learning curve when you first started using Spark, and you'll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you'll need to turn your code into a library that you can share with the world. We'll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.

We'll back up our advice with concrete examples from real packages built atop Spark. You'll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.

Session hashtag: #SAISDD6

Summit Europe 2018 Apache Spark for Library Developers Part 1

October 2, 2018 05:00 PM PT

As a developer, data engineer, or data scientist, you've seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you're solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark. You faced a learning curve when you first started using Spark, and you'll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you'll need to turn your code into a library that you can share with the world. We'll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community. We'll back up our advice with concrete examples from real packages built atop Spark. You'll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own

Summit Europe 2018 Extending Structured Streaming Made Easy with Algebra

October 2, 2018 05:00 PM PT

Apache Spark's Structured Streaming library provides a powerful set of primitives for building streaming pipelines for data processing. However, it is not always obvious how to take full advantage of this power in a way that works naturally with your application's unique business logic. If you associate algebra with solving equations while wishing you were doing something else, think again: we'll see how we can apply the properties of operations we all understand -- like addition, multiplication, and set union -- to reason about our data engineering pipelines.

Attendees will learn easy techniques for exploiting algebraic patterns in their data processing logic that work seamlessly with Spark's Structured Streaming constructs, effectively extending Spark's native primitives with your customized data processing operations. These simple yet powerful ideas will be illustrated with real world examples.

Session hashtag: #SAISDev2

Summit 2018 Birds of a Feather Session: Apache Spark on Kubernetes

October 15, 2021 06:51 PM PT

Come learn about Apache Spark's Kubernetes scheduler backend, new in Spark 2.3! Meet project contributors and network with community members interested in running Spark on Kubernetes. Learn about upcoming Spark features for Kubernetes support, and find out how to contribute to the project. Discover new tools in the Spark on Kubernetes ecosystem, and trade tips on how to run Spark jobs on your Kubernetes cluster.

Summit 2018 Apache Spark for Library Developers SAIS 2018

June 4, 2018 05:00 PM PT

As a developer, data engineer, or data scientist, you've seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you're solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.

You faced a learning curve when you first started using Spark, and you'll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you'll need to turn your code into a library that you can share with the world. We'll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.

We'll back up our advice with concrete examples from real packages built atop Spark. You'll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.

Session hashtag: #DD9SAIS

Summit East 2017 Sketching Data with T-Digest In Apache Spark

February 8, 2017 04:00 PM PT

Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.
T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.

Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.

Summit East 2017 Teaching Apache Spark Clusters to Manage Their Workers Elastically

February 7, 2017 04:00 PM PT

Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.

We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.

The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.

Summit 2017 Smart Scalable Feature Reduction With Random Forests

June 6, 2017 05:00 PM PT

Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features.
In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data.

Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.

Session hashtag: #SFds8

Summit Europe 2017 One-Pass Data Science In Apache Spark With Generative T-Digests

October 25, 2017 05:00 PM PT

The T-Digest has earned a reputation as a highly efficient and versatile sketching data structure; however, its applications as a fast generative model are less appreciated. Several common algorithms from machine learning use randomization of feature columns as a building block. Column randomization is an awkward and expensive operation when performed directly, but when implemented with generative T-Digests, it can be accomplished elegantly in a single pass that also parallelizes across Spark data partitions. In this talk Erik will review the principles of T-Digest sketching, and how T-Digests can be applied as generative models. He will explain how generative T-Digests can be used to implement fast randomization of columnar data, and conclude with demonstrations of T-Digest randomization applied to Variable Importance, Random Forest Clustering and Feature Reduction. Attendees will leave this talk with an understanding of T-Digest sketching, how T-Digests can be used as generative models, and insights into applying generative T-Digests to accelerate their own data science projects.
Session hashtag: #EUds11

Summit Europe 2017 BoF Discussion-Apache Spark on Kubernetes

October 24, 2017 05:00 PM PT

Come learn about the community development project to add a native Kubernetes scheduling back-end to Apache Spark! Network with community members interested in running Spark on Kubernetes. Learn how to run Spark jobs on your Kubernetes cluster; find out how to contribute to the project.