Spark User Concurrency and Context/RDD Sharing at Production Scale

Download Slides

As one of Spark’s early adopters, we will share lessons learned from our work at Zoomdata while making Spark a key architecture component of our application and bringing production level comfort to our user community. This session will focus on specific challenges we faced, the alternatives we evaluated, and ultimately what decisions we made which includes … – Building a decoupled Spark proxy to run as a separate process (YARN enabled) to allow a sharable spark context with sharable RDDs across load balanced Zoomdata servers which enables high user concurrency and fault tolerance. – Simulating various stress loads in Ganglia to measure and optimize for high user concurrency on large amounts of datasets while writing to and reading from Spark in parallel. – Properly tuning our spark application to handle Garbage collection, Serialization/De-serialization of RDDs, using Columnar Storage RDDS rather than Object RDDs, and eventual use of partitioned RDDs. – Lifecycle management when having to deal with large numbers of RDDs – Overcoming binary incompatibilities when supporting Spark on top of different Hadoop distributions and versions.

« back
About Justin Langseth

Justin Langseth is the Founder & CEO of Zoomdata. He previously founded, Claraview, Clarabridge, and Augaroo. A graduate of MIT, Justin is an expert in big data, business intelligence, text analytics, sentiment analytics, and real-time data processing and holds 14 technology patents.

About Farzad Aref

Farzad is the head of Product at Zoomdata and one of its founding employees. He is responsible for Zoomdata’s Roadmap, UX, and Quality. He has over 12 years of experience in building and delivering complex Analytics solutions to Fortune 500 companies through his tenures at Clarabridge, IBM, Deloitte, and now Zoomdata.