Jacob enjoys solving wierd problems, especially when they involve graphs, infrastructure, or data modeling. Preferably some combination of the three. When not writing code he’s working on a sculpture or practicing aerial acrobatics.
Spark has empowered developers with the ability to easily create highly complex and powerful data flows and transformations. However, logical plans and debugging/tuning of executions quickly become difficult to grok given currently existing mechanisms. Debugging Spark jobs is a complex multistep process requiring the composition of multiple sources of information. YARN application logs, the spark ui, individual execution logs, as well as the query itself must be examined to make sense of a workflow. Even then, it's often unclear how the code a developer writes gets compiled into a set of stages and RDDs. It's significantly more difficult when SparkSQL is used (and it should be since it's an excellent abstraction!) due to the introduction of an additional logical layer. We introduce Morticia, a tool for visualizing, debugging, and performing post mortem analysis of complex Spark workflows. It provides a graphical depiction of the spark execution DAG at a logical level, annotated with information as it executes. Historical workflows are archived, enabling post-mortem analysis. A Morticia graph is an interactive visualization that includes spark stages, RDDs, and associated logical operators. Each stage displays important execution information such as start and end times, status, number of tasks, and run-time metrics such as number of input/output records, input/output sizes, and execution memory. RDD nodes and associated logical operators are nested inside the stages and link together to form the whole graph. Each node displays a host of useful diagnostic information such as operation scope, schema, total partitions, and input and output records. Morticia has successfully enabled data scientists at Stitch Fix to develop, tune, and maintain highly complex workflows, while minimizing effort and support required by core infrastructure engineers. We will also discuss some future enhancements of Morticia, and share plans for open sourcing the codebase.