Mark is a software engineer working on Apache Spark at Cloudera. He is a co-author of Hadoop Application Architectures book and also wrote a section in Programming Hive book. Mark is also a committer on Apache Bigtop and a committer and PMC member on Apache Sentry. He has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop and Apache Flume projects. Mark is sought after speaker on topics related to Big Data at various national and international conferences. He occasionally blogs on topics related to technology on his blog.
In the world of distributed computing, Spark has simplified development and open the doors for many to start writing distributed programs. Folks with little to none distributed coding experience can now start writing just a couple lines of code that will get 100s or 1000s of machines, immediately, working on creating business value. However, even through Spark code is easy to write and read, that doesn't mean that users don't run into issues of long running, slow performing jobs or out of memory errors. Thankfully most of the issues with using Spark have nothing to do with Spark but the approach we take when using it. This session will go over the top 5 things that we've seen in the field that prevent people from getting the most out of their Spark clusters. When some of these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters, the same data, just a different approach.
So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions:How do I manage offsets? How do I manage state? How do I make my spark streaming job resilient to failures? Can I avoid some failures? How do I gracefully shutdown my streaming job? How do I monitor and manage (e.g. re-try logic) streaming job? How can I better manage the DAG in my streaming job? When to use checkpointing and for what? When not to use checkpointing? Do I need a WAL when using streaming data source? Why? When don't I need one? In this talk, we'll share practices that no one talks about when you start writing your streaming app, but you'll inevitably need to learn along the way.
So you know you want to write a streaming app, but any non-trivial streaming app developer would have to think about these questions: - How do I manage offsets? - How do I manage state? - How do I make my Spark Streaming job resilient to failures? Can I avoid some failures? - How do I gracefully shutdown my streaming job? - How do I monitor and manage my streaming job (i.e. re-try logic)? - How can I better manage the DAG in my streaming job? - When do I use checkpointing, and for what? When should I not use checkpointing? - Do I need a WAL when using a streaming data source? Why? When don't I need one? This session will share practices that no one talks about when you start writing your streaming app, but you'll inevitably need to learn along the way. Session hashtag: #SFdev5