Josh Reilly is a Lead Software Engineer at Northwestern Mutual. His role is to provide architectural direction as well as enable his teams to be successful through mentoring and the creation of libraries and frameworks that support daily development. Josh has lead the design and development of full-stack web applications using React and an ecosystem of microservices using Typescript. Josh has been building out a suite of configuration-driven frameworks to support Spark ELT (Extract Load Transform) workloads on Databricks. In his free time, Josh likes to play guitar, roast his own coffee, and snowboard with his coworkers.
May 26, 2021 03:50 PM PT
At Northwestern Mutual, we are using Spark on Databricks to perform Extract Load Transform (ELT) workloads. We built a configuration-driven python framework that lands data from various source systems and transforms it using Databricks Delta SQL. The framework bakes in consistency, performance, and access control while allowing our developers to leverage their existing SQL skillsets. With this framework, our developers spend less time creating and configuring spark jobs with minimal code required.
The framework ingests a list of job items from a JSON configuration file, each with a command that generates a dataframe and a list of any number of destinations to write the dataframe to. These commands and destinations are specified by type in the configuration, accompanied by command-specific attributes and another file if required, like a SQL file. We can also ensure certain best-practices are followed using these configurable commands and destinations, such as ensuring we are securing PII data in our destinations, ensuring data is saved in the correct locations, and connecting to valid sources when we retrieve data for the environment the job is run in.
Our key focus for this session will be:
May 26, 2021 05:00 PM PT
We as data engineers are aware of trade off’s between development speed, metadata governance and schema evolution (or restriction) in rapidly evolving organization. Our day to day activities involve adding/removing/updating tables, protecting PII Information, curating and exposing data to our consumers. While our data lake keeps growing exponentially, there is equal increase in our downstream consumers. Struggle is to maintain balance between quickly promoting metadata changes with robust validation for downstream systems stability. In relational world DDL, DML changes can be managed through numerous options available for every kind of database from the vendor or 3rd party. As engineers we developed a tool which uses centralized git managed repository of data schemas in yml structure with ci/cd capabilities which maintains stability of our data lake and downstream systems.
In this presentation Northwestern Mutual Engineers, will discuss how they designed and developed new end-to-end ci/cd driven metadata management tool to make introduction of new tables/views, managing access requests etc in a more robust, maintainable and scalable way, all with only checking in yml files. This tool can be used by people who have no or minimal knowledge of spark.
Key focus will be: