Vicky is a Lead Data Engineer at Cox Automotive Data Solutions. She has over 5 years’ experience writing high-performance applications in MapReduce and Spark. She graduated from the University of Warwick with a Master of Mathematics degree in 2013 and, after a brief stint in Android development, has been solving data problems ever since. She now spends most of her days building and optimizing data pipelines, and is co-creator of Waimak, an open-source framework that makes it easier to create complex data flows in Apache Spark.
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We'll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We'll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we'll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.