Cindy Mottershead is an experienced, hands-on software architect involved in the architecture, design, and implementation of big data, real time, Machine Learning, and distributed systems. She is an effective change agent driving strategic decisions and implementations across the company. She has lead the architecture of scalable, performant, distributed solutions into production, pivoting when necessary to maintain a competitive edge. Her last startup, attentive.ly, was acquired by Blackbaud, where she is currently involved in leading the AI Architecture.
June 24, 2020 05:00 PM PT
We present our solution for building an AI Architecture that provides engineering teams the ability to leverage data to drive insight and help our customers solve their problems. We started with siloed data, entities that were described differently by each product, different formats, complicated security and access schemes, data spread over numerous locations and systems. We discuss how we created a Delta Lake to bring this data together, instrumenting data ingestion from the various external data sources, setting up legal and security protocols for accessing the Delta Lake, and go into detail about our method of making all the data conformed into a Common Data Model using a metadata driven pipeline.
This metadata driven pipeline or Configuration Driven Pipeline (CDP) uses Spark Structured Streaming to take change events from the ingested data, references a Data Catalog Service to obtain mapping and the transformations required to push this conformed data into the Common Data Model. The pipeline uses extensive Spark API to perform the numerous types of transformations required to take these change events as they come in and UPSERT into a Delta CDM. This model can take any set of relational databases (1000s in our case), and transform them into a big data format (Delta Lake/parquet) CDM in a scalable, performant way all from metadata. It can then perform schema-on-read to project from this CDM into any requested destination location (database, filesystem, stream, etc). This provides the ability for Data Scientists to request data by specifying metadata, and the pipeline will automatically run producing the schema they require with all data types conformed to a standard value and depositing it to their specified destination.