This talk will take an two existings Spark ML pipeline (Frank The Unicorn, for predicting PR comments (Scala) – https://github.com/franktheunicorn/predict-pr-comments & Spark ML on Spark Errors (Python)) and explore the steps involved in migrating this into a combination of Spark and Tensorflow. Using the open source Kubeflow project (now with Spark support as of 0.5), we will create an two integrated end-to-end pipelines to explore the challenges involved & look at areas of improvement (e.g. Apache Arrow, etc.).
Holden is an Apache Spark committer and PMC member who focus on PySpark and Kubernetes support. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Her current side project is working on a book to teach children distributed systems, http://www.distributedcomputing4kids.com/.