As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.
You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.
We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.
Session hashtag: #SAISDD6
Erik Erlandson is a Software Engineer at Red Hat, where he investigates analytics use cases and scalable deployments for Apache Spark in the cloud. He also consults on internal data science and analytics projects. Erik is a contributor to Apache Spark and other open source projects in the Spark ecosystem, including the Spark on Kubernetes community project, Algebird and Scala..
William Benton is passionate about making it easier for machine learning practitioners to benefit from advanced infrastructure and making it possible for organizations to manage machine learning systems. His recent roles have included defining product strategy and professional services offerings related to data science and machine learning, leading teams of data scientists and engineers, and contributing to many open source communities related to data, ML, and distributed systems. Will was an early advocate of building machine learning systems on Kubernetes and developed and popularized the “intelligent applications” idiom for machine learning systems in the cloud. He has also conducted research and development related to static program analysis, language runtimes, cluster configuration management, and music technology.