Cheuk Lam is a Consultant Software Engineer with EMC Corporation. He has nine years of experience developing datacenter management software and is currently exploring methods to run datacenter management software at scale. He holds a PhD in Information Engineering from Chinese University of Hong Kong.
Datacenter health and fault analysis requires aggregating, thresholding or otherwise analyzing datacenter element telemetry streams. The key scaling challenge for this analysis is that individual telemetry streams cannot be analyzed in isolation in an "embarrassingly parallel" fashion. Instead, data from various streams must be combined and compared according to the (dynamic) topology of the datacenter. Thus, datacenter analytics are graph analytics and can be approached with a vertex-centric programming model as implemented in GraphX. However, the datacenter graph is extremely heterogeneous, comprising a wide variety of hosts, routers, switches, firewalls and the like. Thus, simple vertex-centric programs such as those used in GraphX to calculate PageRank or connected components of a graph cannot be directly applied to analyze datacenter telemetry. In this talk, we describe our solution: A model-driven approach implemented using Spark Streaming and GraphX. In our approach, the datacenter expert uses a simple Domain Specific Language (DSL) to describe the expressions that must be continuously evaluated at each type of vertex in the datacenter graph. These expressions can depend directly on datacenter telemetry or (recursively) on other expressions defined by the expert. Crucially, expressions defined on one vertex type can depend on the expressions or telemetry defined on other edge-connected vertex types. As an extremely simple example: the designer might define the TotalReadBandwith of each StorageSystem vertex as the sum of the ReadBandwidth values on each of the StorageDevices connected to it over a "Contains" edge. The system automatically transforms this expert model into a vertex-program that can be run directly by GraphX fed by Spark Streaming. Next, we describe in detail our performance results. Finally, we demonstrate the deployment settings and implementation "tricks" which we used to enable both good performance and reasonable recovery times from node failure using Spark's RDD recalculation.