This talk presents a continuous application example that relies on Spark FAIR scheduler as the conductor to orchestrate the entire “lambda architecture” in a single spark context. As a typical time series event stream analysis might involved, there are four key components:- an ETL step to store the raw data
– a series of real time aggregation on the joint of streaming input and historical data to power a model
– model execution
– ad-hoc query for human inspection.
The key benefits of this setup compared to a typical design that has a bunch of Spark application running individually are
1. Decouple streaming batches process from triggering model calculation, model calculations are triggered at a different pace from the stream processing.
2. Model is always processing the latest data, using pure rdd APIs.
3. Launch various operations in different threads on the driver node, ensuring them got submitted to the appropriate fair scheduler pool. Let FAIR scheduler to do the resource distribution.
4. Share code and time by sharing the actual data transformation (like the rdds in the intermediate steps).
5. Support adhoc queries on intermediate state without a dedicated serving layer or output protocol.
6. Only one app to monitor and tune.
Session hashtag: #SFdev17