Applying the Lambda Architecture with Spark

The Lambda Architecture (LA) enables developers to build large-scale, distributed data processing systems in a flexible and extensible manner, being fault-tolerant both against hardware failures and human mistakes.

In the LA we deal with three layers, each coming with its own set of requirements: i. the batch layer, managing the master dataset (an immutable, append-only set of raw data) and pre-computing batch views, ii. the serving layer, indexing batch views so that they can be queried in a low-latency, ad-hoc way, and iii. the speed layer, dealing with recent data only, and compensating for the high latency of the batch layer.

Despite its increasing popularity, some practitioners find it challenging to apply the LA; one reason is that in order to implement the batch and real-time views typically different environments have to be used. For example, batch views might be realized using Hive while the real-time views are implemented via a Storm topology. In addition, business logic is duplicated in two places, requiring to keep it in sync.

With Spark we have a simple, elegant and increasingly popular solution: the Spark stack enables developers to implement an LA-compliant system using a unified development and test environment (pick one of Scala, Python, Java) while supporting both batch and streaming operations, at scale. In the talk we will show an end-to-end demo of a LA-compliant system implemented in Spark and will discuss its features incl. development, testing and maintenance as well as extensibility and operational aspects.

Additional Reading:

  • Improved Fault-tolerance and Zero Data Loss in Apache Spark Streaming

    « back
  • About Jim Scott

    Jim has held positions running Operations, Engineering, Architecture and QA teams. Jim is the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for the past 4 years. Jim has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day. Jim’s work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.