Holden Karau is transgender Canadian, Apache Spark committer, and co-author of Learning Spark & High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM, she worked on a variety of distributed and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science.
As big data jobs move from the proof-of-concept phase into powering real production services, we have to start consider what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and its important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes. Figuring out when things have gone terribly wrong is trickier than it first appears, since we want to catch the errors before our users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist us in writing relative validation rules based on historical data. For folks working in streaming, we will talk about the unique challenges of attempting to validate in a real-time system, and what we can do besides keeping an up-to-date resume on file for when things go wrong. To keep the talk interesting real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures and an adorable panda GIF. If you've seen Holden's previous testing Spark talks this can be viewed as a deep dive on the second half focused around what else we need to do besides good testing practices to create production quality pipelines. If you haven't seen the testing talks watch those on YouTube after you come see this one :)
If you're subscribed to firstname.lastname@example.org, or work in a large company, you may see some common Spark error messages. Even attending Spark Summit over the past few years you have seen talks like the "Top K Mistakes in Spark." While cool non-machine learning based tools do exist to examine Spark's logs -- they don't use machine learning and therefore are not as cool but also limited in by the amount of effort humans can put into writing rules for them. This talk will look what happens when we train "regular" clustering models on stack traces, and explore DL models for classifying user message to the Spark list. Come for the reassurance that the robots are not yet able to fix themselves, and stay to learn how to work better with the help of our robot friends. The tl;dr of this talk is Spark ML on Spark output, plus a little bit of Tensorflow is fun for the whole family, but probably shouldn't automatically respond to user list posts just yet.
Apache Arrow is new in Spark 2.3, and offers faster interchange between Spark and Python. Apache Arrow also has connections to Tensorflow (and even without those can be fed from Pandas). This talk will look at how to use Arrow to accelerate data copy from Spark to Tensorflow, and how to expose basic functionality in Scala for working with Tensorflow. From there we will dive in to how to construct new Deep Learning ML pipeline stages in Python and make them available to be used by our friends in Scala land. Session hashtag: #DL7SAIS
PySpark is getting awesomer in Spark 2.3 with vectorized UDFs, and there is even more wonderful things on the horizon (and currently available as WIP packages). This talk will start by illustrating how to use PySpark's new vectorized UDFs to make ML pipeline stages. Since most of us use Python in part because of its wonderful libraries, like pandas, numpy, and antigravity*, it's important to be able to make sure that our dependencies are available on our cluster. Historically there's been a few If there is time near the end we will talk about how to expose your Python code to Scala so everyone can use your fancy deep learning code (if you want them to). *Ok maybe not a real thing, but insert super specialized domain specific library you use instead :) Session hashtag: #Py4SAIS
Everyone who has maintained a search cluster knows the pain of keeping our on-line update code and offline reindexing pipelines in sync. Subtle bugs can pop up when our data is indexed differently depending on the context. By using Spark & Spark Streaming we can reuse the same indexing code between contexts and even take advantage reduce overhead by talking directly to the correct indexing node. Sometimes we need to use search data as part of our distributed map reduce jobs. We will illustrate how to use Elastic Search as side data source with Spark. We will also illustrate both of these tasks in two real examples using the Twitter firehose. In the first we will index tweets in a geospatial context and in the second we will use the same index to determine the top hashtags per region.