We as data engineers are aware of trade off’s between development speed, metadata governance and schema evolution (or restriction) in rapidly evolving organization. Our day to day activities involve adding/removing/updating tables, protecting PII Information, curating and exposing data to our consumers. While our data lake keeps growing exponentially, there is equal increase in our downstream consumers. Struggle is to maintain balance between quickly promoting metadata changes with robust validation for downstream systems stability. In relational world DDL, DML changes can be managed through numerous options available for every kind of database from the vendor or 3rd party. As engineers we developed a tool which uses centralized git managed repository of data schemas in yml structure with ci/cd capabilities which maintains stability of our data lake and downstream systems.
In this presentation Northwestern Mutual Engineers, will discuss how they designed and developed new end-to-end ci/cd driven metadata management tool to make introduction of new tables/views, managing access requests etc in a more robust, maintainable and scalable way, all with only checking in yml files. This tool can be used by people who have no or minimal knowledge of spark.
Key focus will be:
Josh Reilly is a Lead Software Engineer at Northwestern Mutual. His role is to provide architectural direction as well as enable his teams to be successful through mentoring and the creation of librar...
Keyuri Shah has 13+ years of good experience in IT, with diversified companies. Her heart and mind expertise in designing, prototyping, building and deploying scalable data processing pipelines on dis...