Justin Langseth is the Founder & CEO of Zoomdata. He previously founded Strategy.com, Claraview, Clarabridge, and Augaroo. A graduate of MIT, Justin is an expert in big data, business intelligence, text analytics, sentiment analytics, and real-time data processing and holds 14 technology patents.
“Spark allows for extremely fast analytics and joins across huge amounts of data, and the SparkSQL and SchemaRDD extensions in Spark 1.0 provide for new, easier interoperability with existing Hadoop-based data resources and schematized data. We will share our work at Zoomdata implementing real-time and historical BI-style slice and dice analytics and dashboarding directly on top of Spark (without Shark, due to performance issues that we will discuss). We will highlight our early lessons learned related to data scalability, loading, context sharing, real-time RDD appending/coalescing, and concurrent query handling. Also we will discuss the new SparkSQL and SchemaRDD features available in Spark 1.0 that allow direct access to Parquet and other schematized data, and discuss partitioning strategies to allow for in-application partition elimination to speed large analytical queries.”
As one of Spark’s early adopters, we will share lessons learned from our work at Zoomdata while making Spark a key architecture component of our application and bringing production level comfort to our user community. This session will focus on specific challenges we faced, the alternatives we evaluated, and ultimately what decisions we made which includes … – Building a decoupled Spark proxy to run as a separate process (YARN enabled) to allow a sharable spark context with sharable RDDs across load balanced Zoomdata servers which enables high user concurrency and fault tolerance. – Simulating various stress loads in Ganglia to measure and optimize for high user concurrency on large amounts of datasets while writing to and reading from Spark in parallel. – Properly tuning our spark application to handle Garbage collection, Serialization/De-serialization of RDDs, using Columnar Storage RDDS rather than Object RDDs, and eventual use of partitioned RDDs. – Lifecycle management when having to deal with large numbers of RDDs – Overcoming binary incompatibilities when supporting Spark on top of different Hadoop distributions and versions.