Extending Spark Machine Learning: Adding Your Own Algorithms and Tools

Download Slides

Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and then looks at how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark’s ML pipelines, you will be able to take advantage of useful meta-algorithms, like parameter searching and pipeline persistence (with a bit more work, of course).
Even if you don’t have your own machine learning algorithms that you want to implement, this session will give you an inside look at how the ML APIs are built. It will also help you make even more awesome ML pipelines and customize Spark models for your needs. And if you don’t want to extend Spark ML pipelines with custom algorithms, you’ll still benefit by developing a stronger background for future Spark ML projects.

The examples in this talk will be presented in Scala, but any non-standard syntax will be explained.

Session hashtag: #SFml2

About Seth Hendrickson

Seth Hendrickson is a top Apache Spark contributor. He implemented multinomial logistic regression with elastic-net regularization in Spark's ML library and has contributed several other performance improvements to linear models in Spark. He has also made extensive contributions to Spark ML decision trees and ensemble algorithms. Prior to joining IBM, Seth was an electrical engineer working on signal processing and IOT. He earned his M.S. in electrical engineering from Georgia Institute of Technology.