When beginning to use Spark we have the choice between two roads to go down: We can either sit down and leverage the convenience of high-level APIs to implement the use cases we came for directly. Usually, we achieve this with trial, error and StackOverflow. By doing so, we rely on Spark to magically execute our workload in the hopefully most efficient way. Most developers would stop here. Or, we start our journey with a different approach by firstly gathering an understanding of Spark’s concepts and what is happening internally. From my experience, most people, and also most companies, tend to take the former approach and start with implementing their use cases right away. This certainly valid approach, however, often leaves us out in the rain as performance issues arise when we try to scale our projects.
Throughout this talk, we will go on a walk through the most important Spark (Core) internal components to gain a deeper understanding of how parallelization is achieved. Based on these insights, we will illuminate some of the most common performance pitfalls and analyze where they originate. No matter if you are an experienced Spark user or want to leverage all of its beauty right from the start – this talk gives practical advice on how to write better Spark code.
Session hashtag: #SAISDev16
Philipp is a free-lance data science and big data consultant, supporting his clients to bring data-driven use cases to life. He is passionate about helping companies to create innovative applications, to boost existing ones, and to educate teams on how to write scalable applications. When he works together with in-house development teams, he always desires to leave behind a different way of thinking about their challenges and how to solve them. He has been speaking at various events to help people gain a better understanding of how Spark is designed and its most fundamental concepts.