Fred Kimball is a Software Engineer at Northwestern Mutual. His responsibilities include building, maintaining, and securing data infrastructure, creating automated build and deployment pipelines, and development towards several frameworks supporting the Spark ecosystem at NM. Fred strives to simplify development, optimize workloads to perform at their best, and prepare tools for the future. He enjoys diving deep into new technologies to learn how they work, along with all their quirks and tricks. When his need for speed in his Spark jobs is met, Fred enjoys his passions for gaming and cars.
May 26, 2021 03:50 PM PT
At Northwestern Mutual, we are using Spark on Databricks to perform Extract Load Transform (ELT) workloads. We built a configuration-driven python framework that lands data from various source systems and transforms it using Databricks Delta SQL. The framework bakes in consistency, performance, and access control while allowing our developers to leverage their existing SQL skillsets. With this framework, our developers spend less time creating and configuring spark jobs with minimal code required.
The framework ingests a list of job items from a JSON configuration file, each with a command that generates a dataframe and a list of any number of destinations to write the dataframe to. These commands and destinations are specified by type in the configuration, accompanied by command-specific attributes and another file if required, like a SQL file. We can also ensure certain best-practices are followed using these configurable commands and destinations, such as ensuring we are securing PII data in our destinations, ensuring data is saved in the correct locations, and connecting to valid sources when we retrieve data for the environment the job is run in.
Our key focus for this session will be: