Jeff Smith builds large-scale machine learning systems using Scala and Spark. For the past decade, he has been working on data science applications at various startups in New York, San Francisco, and Hong Kong. He’s a frequent blogger and the author of “Reactive Machine Learning Systems,” an upcoming book from Manning on how to build real-world machine learning systems using Scala, Akka, and Spark.
Generating features is one of the most important and little discussed aspects of building machine learning systems. Because it's easy to get started, data teams can often wind up with a bunch of features that have all sorts of problems. Some are inaccurate, some do repeated work, some are unreliable, some are unnecessary, and the whole thing is just too slow and opaque. In this talk, I'll show you how features don't have to be such a pain by using techniques from the reactive approach to machine learning. When building reactive machine learning systems, we try to hold our complicated large-scale machine learning systems to the same standard as modern web and mobile apps. Using the Reactive Manifesto as our guide, we can see the traits we want our machine learning system to have: responsiveness, elasticity, resilience, and so on. This talk focuses in on how to achieve those traits for feature generation pipelines specifically. We'll start by building on top of MLlib's feature generation capabilities and the broader capabilities of Spark, but we'll go well beyond the example code in the programming guide. Key points along the way will include: • Structuring feature transforms • Supervising feature pipelines • Operating on collections of features • Techniques for validating features You don't have to be frustrated by your feature generation code! Using the power of Spark and the principles of reactive machine learning, you too can have awesome feature generation capabilities that help you achieve your data science goals.
Something really exciting and largely unnoticed is going on in the Spark ecosystem. As data scientists and engineers learn Spark, they’re actually all implicitly learning a much older, more general topic: typed functional programming. While Spark itself was built on an accumulation of powerful computer science concepts from functional programming and other areas, developers are often encountering these ideas in the context of Spark for the first time. It turns out that Spark makes an excellent platform for learning concepts like immutability, higher order and anonymous functions, laziness, and monadic operators. This talk will discuss how Spark can be used as teaching tool, to build skills in areas like typed functional programming. We’ll explore a skill-building curriculum that can be used with a data scientist or engineer who only has experience in imperative, dynamically-typed languages like Python. This curriculum introduces the core concepts of functional programming and type theory, while providing learners the opportunity to immediately apply their skills at massive scale, using the power of Spark’s painless scalability and resilience. Based on the experience of building machine learning teams at x.ai and other data-centric startups, this curriculum is the foundation of building poly-skilled, highly autonomous team members who can build scalable intelligent systems. We’ll work from foundational concepts of Scala and functional programming towards a fully implemented machine learning pipeline, all using Spark and MLlib. Unique new features of Spark like Datasets and Structured Streaming will be particularly useful in this effort. Using this approach, teams can help members in all roles learn how to use sophisticated programming techniques that ensure correctness at scale. With these skills in their toolbox, data scientists and engineers often find that building powerful machine learning systems is intuitive, easy, and even fun.