Best Practices for Building and Deploying Data Pipelines in Apache Spark - Databricks

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Download Slides

Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.

We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.

 

Try Databricks
See More Spark + AI Summit Europe 2019 Videos

« back
About Vicky Avison

Cox Automotive UK

Vicky is a Lead Data Engineer at Cox Automotive Data Solutions. She has over 5 years' experience writing high-performance applications in MapReduce and Spark. She graduated from the University of Warwick with a Master of Mathematics degree in 2013 and, after a brief stint in Android development, has been solving data problems ever since. She now spends most of her days building and optimizing data pipelines, and is co-creator of Waimak, an open-source framework that makes it easier to create complex data flows in Apache Spark.

About Alex Bush

KPMG Lighthouse

Alex Bush is a Data Engineer at KPMG Lighthouse New Zealand. He was previously a Lead Data Engineer at Cox Automotive where he co-created Waimak, an open-source framework that makes it easier to create complex data flows in Apache Spark. Having graduated from the University of Edinburgh with a Masters in Computational Physics, he took to the world of Big Data 6 years ago where he has been at home ever since. He has previously worked for Centrica/British Gas and Hortonworks.