Bryan Cutler is a software engineer at IBM’s Spark Technology Center, where he works on big data analytics and machine learning systems. He is a contributor to Apache Spark in the areas of ML, SQL, Core and Python and a committer for the Apache Arrow project. His interests are in pushing the boundaries of software to build high performance tools that are also a snap to use.
In the big data world, it's not always easy for Python users to move huge amounts of data around. Apache Arrow defines a common format for data interchange, while Arrow Flight introduced in version 0.11.0, provides a means to move that data efficiently between systems. Arrow Flight is a framework for Arrow-based messaging built with gRPC. It enables data microservices where clients can produce and consume streams of Arrow data to share it over the wire. In this session, I'll give a brief overview of Arrow Flight from a Python perspective, and show that it's easy to build high performance connections when systems can talk Arrow. I'll also cover some ongoing work in using Arrow Flight to connect PySpark with TensorFlow - two systems with great Python APIs but very different underlying internal data.
Spark 2.3.0 set a great foundation for using Apache Arrow to increase Python performance and interoperability with Pandas. Come by and share your use cases to see if using Arrow could work to improve your Spark jobs. Discuss possible next steps for leveraging Arrow in Spark, and how it would jumpstart Machine Learning and Deep Learning workloads.
Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times. Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature. Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines. Session hashtag: #DS6SAIS
PySpark is getting awesomer in Spark 2.3 with vectorized UDFs, and there is even more wonderful things on the horizon (and currently available as WIP packages). This talk will start by illustrating how to use PySpark's new vectorized UDFs to make ML pipeline stages. Since most of us use Python in part because of its wonderful libraries, like pandas, numpy, and antigravity*, it's important to be able to make sure that our dependencies are available on our cluster. Historically there's been a few If there is time near the end we will talk about how to expose your Python code to Scala so everyone can use your fancy deep learning code (if you want them to). *Ok maybe not a real thing, but insert super specialized domain specific library you use instead :) Session hashtag: #Py4SAIS