Keyuri Shah has 13+ years of good experience in IT, with diversified companies. Her heart and mind expertise in designing, prototyping, building and deploying scalable data processing pipelines on distributed platform using Python, Spark, Databricks Delta Lake ecosystem. She works two passionate jobs every day, the second being a professional one as lead engineer at Northwestern Mutual and first being a mother of two beautiful boys. Just like her hero Sherlock Holmes, she likes to solve complex problems in this case big data systems involving volume, variety and velocity.
May 26, 2021 03:15 PM PT
The volume of available data is growing by the second (to an estimated 175 zetabytes by 2025), and it is becoming increasingly granular in its information. With that change every organization is moving towards building a data driven culture. We at Northwestern Mutual share similar story of driving towards making data driven decisions to improve both efficiency and effectiveness. Legacy system analysis revealed bottlenecks, excesses, duplications etc. Based on ever growing need to analyze more data our BI Team decided to make a move to more modern, scalable, cost effective data platform. As a financial company, data security is as important as ingestion of data. In addition to fast ingestion and compute we would need a solution that can support column level encryption, Role based access to different teams from our datalake.
In this talk we describe our journey to move 100’s of ELT jobs from current MSBI stack to Databricks and building a datalake (using Lakehouse). How we reduced our daily data load time from 7 hours to 2 hours with capability to ingest more data. Share our experience, challenges, learning, architecture and design patterns used while undertaking this huge migration effort. Different sets of tools/frameworks built by our engineers to help ease the learning curve that our non-Apache Spark engineers would have to go through during this migration. You will leave this session with more understand on what it would mean for you and your organization if you are thinking about migrating to Apache Spark/Databricks.
May 26, 2021 05:00 PM PT
We as data engineers are aware of trade off’s between development speed, metadata governance and schema evolution (or restriction) in rapidly evolving organization. Our day to day activities involve adding/removing/updating tables, protecting PII Information, curating and exposing data to our consumers. While our data lake keeps growing exponentially, there is equal increase in our downstream consumers. Struggle is to maintain balance between quickly promoting metadata changes with robust validation for downstream systems stability. In relational world DDL, DML changes can be managed through numerous options available for every kind of database from the vendor or 3rd party. As engineers we developed a tool which uses centralized git managed repository of data schemas in yml structure with ci/cd capabilities which maintains stability of our data lake and downstream systems.
In this presentation Northwestern Mutual Engineers, will discuss how they designed and developed new end-to-end ci/cd driven metadata management tool to make introduction of new tables/views, managing access requests etc in a more robust, maintainable and scalable way, all with only checking in yml files. This tool can be used by people who have no or minimal knowledge of spark.
Key focus will be: