Anna Holschuh - Databricks

Anna Holschuh

Lead Data Scientist / Engineer, Independent

Anna Holschuh is a Lead Data Engineer for Target HQ in the Enterprise Data Analytics and Business Intelligence team. She has combined her love of all things Target with building scalable, high-throughput systems with an emphasis on Machine Learning. At Target, Anna is currently building Spark production pipelines that help bring the best mix of products to Target guests all over the country. She completed her S.B. and M.Eng in EECS at MIT, with a focus in Machine learning for her graduate work. Anna hails from the Twin Cities in Minnesota.

UPCOMING SESSIONS

Oh Hell: Adventures in Scaling Reinforcement Learning Concepts in Spark to Learn Optimal Strategy For A Family Favorite Card GameSummit 2020

Oh Hell is a trick-taking card game where the goal is to take the exact number of tricks specified by a bid before a hand is played. The game plays a bit like the more familiar games of Hearts or Bridge. It is a personal family favorite and often brings out fierce competition and much debate over what is the optimal bidding and playing strategy. It quickly became clear that to stay competitive on the family Oh Hell circuit, it was going to become necessary to build out an AI with Spark to learn optimal strategy. In this session, we will chronicle the journey end to end of leveraging Spark to build out a large-scale system capable of learning optimal card strategy for this multiagent, stochastic game with imperfect information.

We will cover the mechanics and math of the Oh Hell card game, how to represent and model this game as a Markov Decision Process, how to generate massive amounts of game play data with Monte Carlo simulation techniques using Spark, how Reinforcement Learning concepts can be applied, and finally how to leverage Spark to scale out both the massive state space and computation needed to learn strategy. If recent advancements in game playing AIs have excited you and you haven't been sure where to begin, this talk is intended to be a be a self-contained introduction to these complex concepts with a fun use case. Come with little to lots of prior knowledge on these topics, and come away with new ideas to hack your own beloved family games with Spark for the best strategies, much to your family's dismay.

PAST SESSIONS

Lessons in Linear Algebra at Scale with Apache Spark : Let’s Make the Sparse Details a Bit More DenseSummit 2019

If you enjoy Linear Algebra, Spark, and exceptionally bad puns, then this could be the talk for you! In this session, we will chronicle the adventures of developing a large-scale Spark system in Scala at Target to power a text-based similarity engine by using core Linear Algebra concepts. You will not hear about a shiny system and how awesome it is, but instead you will learn about everything that went wrong and all of the lessons that were learned along the way. We will cover concepts like Cosine Similarity, Spark's Distributed Matrix APIs, the Breeze numerical processing library under the hood that powers these APIs, among other things. We will embark on this system development journey together to understand what it took from beginning to end to pull a performant and scalable similarity engine together. Linear Algebra is often the backbone of many prominent machine learning algorithms, and the goal is that from this session, you will gain a deeper understanding into what gotchas exist and what is needed to design, tune, and scale these types of systems.

Parallelizing with Apache Spark in Unexpected WaysSummit 2019

Out of the box, Spark provides rich and extensive APIs for performing in memory, large-scale computation across data. Once a system has been built and tuned with Spark Datasets/Dataframes/RDDs, have you ever been left wondering if you could push the limits of Spark even further? In this session, we will cover some of the tips learned while building retail-scale systems at Target to maximize the parallelization that you can achieve from Spark in ways that may not be obvious from current documentation. Specifically, we will cover multithreading the Spark driver with Scala Futures to enable parallel job submission.

We will talk about developing custom partitioners to leverage the ability to apply operations across understood chunks of data and what tradeoffs that entails. We will also dive into strategies for parallelizing scripts with Spark that might have nothing to with Spark to support environments where peers work in multiple languages or perhaps a different language/library is just the best thing to get the job done. Come learn how to squeeze every last drop out of your Spark job with strategies for parallelization that go off the beaten path.

Extending Apache Spark APIs Without Going Near Spark Source or a CompilerSummit 2018

The more time you spend developing within a framework such as Apache Spark, you learn there are additional features that would be helpful to have given the context and details of your specific use case. Spark supports a very concise and readable coding style using functional programming paradigms. Wouldn’t it be awesome to add your own functions into the mix using the same style? Well you can! In this session, you will learn about using Scala’s “Enrich my library” programming pattern to add new functionality to Spark’s APIs. We will dive into a how-to guide with code snippets and present an example where this strategy was used to develop a validation framework for Spark Datasets in a production pipeline. Come learn how to enrich your Spark! Session hashtag: #DevSAIS19