Anna Holschuh is a Lead Data Engineer for Target HQ in the Enterprise Data Analytics and Business Intelligence team. She has combined her love of all things Target with building scalable, high-throughput systems with an emphasis on Machine Learning. At Target, Anna is currently building Spark production pipelines that help bring the best mix of products to Target guests all over the country. She completed her S.B. and M.Eng in EECS at MIT, with a focus in Machine learning for her graduate work. Anna hails from the Twin Cities in Minnesota.
If you enjoy Linear Algebra, Spark, and exceptionally bad puns, then this could be the talk for you! In this session, we will chronicle the adventures of developing a large-scale Spark system in Scala at Target to power a text-based similarity engine by using core Linear Algebra concepts. You will not hear about a shiny system and how awesome it is, but instead you will learn about everything that went wrong and all of the lessons that were learned along the way. We will cover concepts like Cosine Similarity, Spark's Distributed Matrix APIs, the Breeze numerical processing library under the hood that powers these APIs, among other things. We will embark on this system development journey together to understand what it took from beginning to end to pull a performant and scalable similarity engine together. Linear Algebra is often the backbone of many prominent machine learning algorithms, and the goal is that from this session, you will gain a deeper understanding into what gotchas exist and what is needed to design, tune, and scale these types of systems.
Out of the box, Spark provides rich and extensive APIs for performing in memory, large-scale computation across data. Once a system has been built and tuned with Spark Datasets/Dataframes/RDDs, have you ever been left wondering if you could push the limits of Spark even further? In this session, we will cover some of the tips learned while building retail-scale systems at Target to maximize the parallelization that you can achieve from Spark in ways that may not be obvious from current documentation. Specifically, we will cover multithreading the Spark driver with Scala Futures to enable parallel job submission. We will talk about developing custom partitioners to leverage the ability to apply operations across understood chunks of data and what tradeoffs that entails. We will also dive into strategies for parallelizing scripts with Spark that might have nothing to with Spark to support environments where peers work in multiple languages or perhaps a different language/library is just the best thing to get the job done. Come learn how to squeeze every last drop out of your Spark job with strategies for parallelization that go off the beaten path.
The more time you spend developing within a framework such as Apache Spark, you learn there are additional features that would be helpful to have given the context and details of your specific use case. Spark supports a very concise and readable coding style using functional programming paradigms. Wouldn’t it be awesome to add your own functions into the mix using the same style? Well you can! In this session, you will learn about using Scala’s “Enrich my library” programming pattern to add new functionality to Spark’s APIs. We will dive into a how-to guide with code snippets and present an example where this strategy was used to develop a validation framework for Spark Datasets in a production pipeline. Come learn how to enrich your Spark! Session hashtag: #DevSAIS19