Ryan Murray is a Principal consulting engineer at Dremio in the professional services organization since July 2019, previously in the financial services industry doing everything from bond trader to data engineering lead. Ryan is a PhD in Theoretical Physics and an active open source contributor who dislikes when data isn’t accessible in an organisation. Passionate about making customers successful and self sufficient. Still one day dreams of winning the Stanley Cup.
Machine learning pipelines are a hot topic at the moment. Moving data through the pipeline in an efficient and predictable way is one of the most important aspects of running machine learning models in production. In this talk, we'll break down the modern machine learning pipeline and demonstrate how it can be improved with a modern transport mechanism. First, we will introduce Apache Arrow and Arrow Flight. We will review the motivation, architecture and key features of the Arrow Flight protocol with an example of a simple Flight server and client. Second, we'll introduce an Arrow Flight Spark datasource. We will examine the key features of this datasource and show how one can build microservices for and with Spark. We will look at the benchmarks and benefits of Flight versus other common transport protocols. Finally, we'll show a Demo of a toy machine learning pipeline running in Spark with data microservices powered by Arrow Flight. We will highlight how much faster and simpler the flight interface makes this example pipeline. The audience will leave this session with an understanding of how Apache Arrow Flight can enable more efficient machine learning pipelines in Spark.