William Benton leads a team of data scientists and engineers at Red Hat, where he has applied analytic techniques to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is designing infrastructure for next-generation data-driven applications, but he has also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology. Benton holds a PhD in computer sciences from the University of Wisconsin.
As a developer, data engineer, or data scientist, you've seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you're solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark. You faced a learning curve when you first started using Spark, and you'll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you'll need to turn your code into a library that you can share with the world. We'll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community. We'll back up our advice with concrete examples from real packages built atop Spark. You'll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.
There are lots of reasons why you might want to implement your own machine learning algorithms on Spark: you might want to experiment with a new idea, try and reproduce results from a recent research paper, or simply to use an existing technique that isn't implemented in MLlib. In this talk, we'll walk through the process of developing a new machine learning algorithm for Spark. We'll start with the basics, by considering how we'd design a scale-out parallel implementation of our unsupervised learning technique. The bulk of the talk will focus on the details you need to know to turn an algorithm design into an efficient parallel implementation on Spark: we'll start by reviewing a simple RDD-based implementation, show some improvements, point out some pitfalls to avoid, and iteratively extend our implementation to support contemporary Spark features like ML Pipelines and structured query processing. We'll conclude by briefly examining some useful techniques to complement scale-out performance by scaling our code up, taking advantage of specialized hardware to accelerate single-worker performance. You'll leave this talk with everything you need to build a new machine learning technique that runs on Spark.
High-quality downstream distributions of open-source projects benefit everyone. End-users enjoy convenient installation and upgrades, dependency management, system integration, and the fruits of a thriving testing and support community. Downstream packagers contribute testing and fixes to upstream developers and free up core teams to focus on enhancements and fixes rather than on the details of packaging. In this talk, we’ll discuss these benefits and present our efforts — along with the Fedora Big Data SIG — to package Spark for Fedora. We’ll cover some of the unique challenges presented by the impedance mismatch between traditional downstream packaging models and the Scala and big data ecosystems, present our current progress, and discuss opportunities for other members of the community to get involved.
Spark’s support for efficient execution and rapid interactive prototyping enable novel approaches to understanding data-rich domains that have historically been underserved by analytical techniques. One such field is endurance sports, where athletes are faced with GPS and elevation traces as well as samples from heart rate, cadence, temperature, and wattage sensors. These data streams can be somewhat comprehensible at any given moment, when looking at a small window of samples on one’s watch or cycle computer, but are overwhelming in the aggregate. In this talk, I’ll present my recent efforts using Spark and MLLib to mine my personal cycling training data for deeper insights and help me design workouts to meet particular fitness goals. This work incorporates analysis of geographic and time-series data, computational geometry, visualization, and domain knowledge of exercise physiology. I’ll show how Spark made this work possible, demonstrate some novel techniques for analyzing fitness data, and discuss how these approaches could be applied to make sense of data from an entire community of cyclists.
Successful companies use analytic measures to identify and reward their best projects and contributors. Successful open source developers often make similar decisions when they evaluate whether or not to reward a project or community by investing their time. This talk will show how Spark enables a data-driven understanding of the dynamics of open source communities, using operational data from the Fedora Project as an example. With thousands of contributors and millions of users, Fedora is one of the world’s largest open-source communities. Notably, Fedora also has completely open infrastructure: every event related to the project’s daily operation is logged to a public messaging bus, and historical event data are available in bulk. We’ll demonstrate best practices for using Spark SQL to ingest bulk data with rich, nested structure, using ML pipelines to make sense of software community data, and keeping insights current by processing streaming updates.
Contemporary applications and infrastructure software leave behind a tremendous volume of metric and log data. This aggregated "digital exhaust" is inscrutable to humans and difficult for computers to analyze, since it is vast, complex, and not explicitly structured. This session will introduce the log processing domain and provide practical advice for analyzing log data with Apache Spark, including: * how to impose a uniform structure on disparate log sources; * machine-learning techniques to detect infrastructure failures automatically and characterize the text of log messages; * best practices for tuning Spark, training models against structured data, and ingesting data from external sources like ElasticSearch; and * a few relatively painless ways to visualize your results. You'll have a better understanding of the unique challenges posed by infrastructure log data after this session. You'll also learn the most important lessons from our efforts both to develop analytic capabilities for an open-source log aggregation service and to evaluate these at enterprise scale.
Consider two recent trends in application development: more and more applications are taking advantage of architectures involving containerized microservices in order to enable improved elasticity, fault-tolerance, and scalability -- whether in the public cloud or on-premise. In addition, analytic capabilities and scalable data processing have increasingly become a basic requirement for contemporary applications. The confluence of these trends suggests that there are a lot of good reasons to want to manage Spark with a container orchestration platform, but it's not quite as simple as packaging up a standalone cluster in containers. This talk will present our team's experiences migrating a production Spark cluster from a multi-tenant Mesos cluster to a shared compute resource managed by Kubernetes. We'll explain the motivation behind microservices and containers and identify the architectures that make sense for containerized applications that depend on Spark. We'll pay special attention to practical concerns of running Spark in containers, including networking, access control, persistent storage, and multitenancy. You'll leave this talk with a better understanding of why you might want to run Spark in containers and some concrete ideas for how to get started doing it.
Developers love Linux containers, which neatly package up an application and its dependencies and are easy to create and share. However, this unbeatable developer experience hides some deployment challenges for real applications: how do you wire together pieces of a multi-container application? Where do you store your persistent data if your containers are ephemeral? Do containers really contain and isolate your application, or are they merely hiding potential security vulnerabilities? Are your containers scheduled across your compute resources efficiently, or are they trampling on one another? Container application platforms like Kubernetes provide the answers to some of these questions. We’ll draw on expertise in Linux security, distributed scheduling, and the Java Virtual Machine to dig deep on the performance and security implications of running in containers. This talk will provide a deep dive into tuning and orchestrating containerized Spark applications. You’ll leave this talk with an understanding of the relevant issues, best practices for containerizing data-processing workloads, and tips for taking advantage of the latest features and fixes in Linux Containers, the JDK, and Kubernetes. You’ll leave inspired and enabled to deploy high-performance Spark applications without giving up the security you need or the developer-friendly workflow you want.
There are lots of reasons why you might want to implement your own machine learning algorithms on Spark: you might want to experiment with a new idea, try and reproduce results from a recent research paper, or simply to use an existing technique that isn't implemented in MLlib. In this talk, we'll walk through the process of developing a new machine learning model for Spark. We'll start with the basics, by considering how we'd design a parallel implementation of a particular unsupervised learning technique. The bulk of the talk will focus on the details you need to know to turn an algorithm design into an efficient parallel implementation on Spark: we'll start by reviewing a simple RDD-based implementation, show some improvements, point out some pitfalls to avoid, and iteratively extend our implementation to support contemporary Spark features like ML Pipelines and structured query processing. You'll leave this talk with everything you need to build a new machine learning technique that runs on Spark. Session hashtag: #EUds5