Skip to main content
Company Blog

This is a guest post from our friends at MapR.


 

This blog summarizes my conversations over the last few months with users who have deployed Apache Spark in production on the MapR Distribution including Hadoop. My key observations overall are that Spark is indeed making inroads into our user community, which is leveraging not just the rapid application development and performance capabilities of Spark, but also the power of a complete Spark stack that the MapR platform uniquely supports.

Why Spark?

We asked our users what they learned after deploying Spark, and here is what they had to share:

  1. Traditional MapReduce is definitely hard to code and maintain. Users want to build a number of applications as quickly as possible, and Spark now allows them to cut down on the development and maintenance time. This trend is in line with a survey that we conducted recently, and found that 18% of MapR customers have deployed over 50 use cases on a single cluster. Users mentioned that platform capabilities such as multi-tenancy, high availability and data protection are even more critical when deploying so many applications, so rapidly.
  2. Although Scala provides good advantages for Spark app development, there are enough developers out there who are using Java APIs to build Spark applications. Java 8, with support for Lambda expressions, is expected to make their life considerably easier. Python APIs are mostly being used by a smaller subset of users—the data scientist community— mainly for initial data modeling purposes.

 

Use Cases Overview

There are many different use cases that have been deployed combining Spark with MapR. Here are a few:

  1. Faster batch applications: Spark in-memory speeds are a definite plus point, especially for customer-facing applications. Many users have figured out that if their datasets can easily fit into memory based on the number of nodes they have, and if latency matters for that particular use case, then they need to quickly move towards converting those apps to Spark to gain performance advantages. A leading sales performance management company has done exactly this for their production application, originally written using traditional MapReduce.
  2. ETL data pipelines: Given the full Spark stack support on MapR, a number of users are merging complex ETL pipelines into simpler programs that include feeding MLLib/Spark Streaming output to Spark SQL and GraphX applications. Novartis does this for drug discovery, using Spark for graph manipulations at scale.Several large financial services customers of MapR are doing ETL on streaming data from web clickstream and loading into transactional applications for call center applications so that customer service reps have all the latest information about what customers have been researching online.
  3. OLAP Cubes: An emerging Spark use case across our customer base is one of an OLAP cube, where the end user can slice and dice an OLAP cube based on preconfigured datasets and filters. Predefined data loaded within a Spark context can be altered in real time by end users via predefined filters that kick off on-the-fly aggregations and simple linear regressions in the background. This solution is being used to deploy customer-facing services for real-time multidimensional OLAP analysis. As an example, Quantium, one of the largest analytics services provider in Australia, has implemented this solution for its end users.
  4. Operational Analytics: Yet another use case is real-time dashboarding and alerting systems based on streaming data, time-series data or operational data such as web clickstreams where a NoSQL store such as MapR-DB is being deployed as a durable, high-throughput persistence layer. A large retail analytics firm, a prominent financial services firm as well as a Fortune 100 healthcare company are implementing such solutions in production.

 

Platform Capabilities Still Matter

It may not come as a surprise, but the same enterprise-grade features that MapR customers have traditionally enjoyed continue to be applicable for Spark apps on Hadoop. NFS ingestion, high availability, a great option for an in-Hadoop NoSQL database, disaster recovery, and cross-datacenter replication still continue to matter and complete the story for production deployments.

 

Want to Learn More?

Read customer case studies for Spark on Hadoop.

If you are new to big data, check out our Spark-based Quick Start Solutions for Hadoop.