At Northwestern Mutual, we are using Spark on Databricks to perform Extract Load Transform (ELT) workloads. We built a configuration-driven python framework that lands data from various source systems and transforms it using Databricks Delta SQL. The framework bakes in consistency, performance, and access control while allowing our developers to leverage their existing SQL skillsets. With this framework, our developers spend less time creating and configuring spark jobs with minimal code required.
The framework ingests a list of job items from a JSON configuration file, each with a command that generates a dataframe and a list of any number of destinations to write the dataframe to. These commands and destinations are specified by type in the configuration, accompanied by command-specific attributes and another file if required, like a SQL file. We can also ensure certain best-practices are followed using these configurable commands and destinations, such as ensuring we are securing PII data in our destinations, ensuring data is saved in the correct locations, and connecting to valid sources when we retrieve data for the environment the job is run in.
Our key focus for this session will be:
Fred Kimball is a Software Engineer at Northwestern Mutual. His responsibilities include building, maintaining, and securing data infrastructure, creating automated build and deployment pipelines, and...
Josh Reilly is a Lead Software Engineer at Northwestern Mutual. His role is to provide architectural direction as well as enable his teams to be successful through mentoring and the creation of librar...