Dealing with problems that arise when running a long process over a large dataset can be one of the most time consuming parts of development. For this reason, many data engineers and scientists will save intermediate results and use them to quickly zero in on the sections which have issues and avoid rerunning sections that are working as intended. For data pipelines that have several sections, dealing with the saving and loading of intermediate results can become almost as complicated as the core problem that the developers are trying to solve. Changes that are made may require previously saved intermediate results to be invalidated and overwritten. This process is typically manual and it’s very easy for a developer to mistakenly use outdated intermediate results. These problems can be even worse when multiple developers are sharing intermediate results. These issues can be addressed by the introduction of a logical signature for datasets. For each dataset, we’ll compute a signature based on the indentity of the input and on the logic applied. If the input and logic stay the same for some dataset between two executions, the signature will be consistent and we can safely load previously saved results. If either the input or the logic change then the signature will change and the dataset will be freshly computed. With these signatures, we can implement automatic checkpointing that works even among several concurrent users and other useful features as well.
Nimbus Goehausen is a senior software engineer at Bloomberg where he works on spark infrastructure and applications. Prior to joining Bloomberg he worked at Radius Intelligence where he developed fuzzy business matching pipelines using spark and hadoop. Having experienced many pains involved with developing complex big data pipelines, he's looking to find ways of improving the development experience with spark. Nimbus has a bachelors degree in physics and computers science from UC Berkeley with a research focus in robotic control.