“Shiran Algai is a Senior Manager of Software Development in Blackbaud’s Data Intelligence Center Of Excellence. Blackbaud is the world’s leading cloud software company powering social good. Shiran started his career at Blackbaud in 2006 after graduating from Clemson University with a BS in Computer Engineering. After 10 years as a software engineer working and leading several initiatives in Blackbaud’s wide portfolio of products, Shiran took an opportunity to move into the management side of software engineering. For the past two years, Shiran has managed Blackbaud’s Data Platform team and initiative, which is at the center of Blackbaud’s analytical transformation.
Shiran has been a volunteer with the TEALS program, a Microsoft Philanthropies program that connects classroom teachers with tech-industry volunteers to create sustainable CS programs, for the past 3 years.”
We present our solution for building an AI Architecture that provides engineering teams the ability to leverage data to drive insight and help our customers solve their problems. We started with siloed data, entities that were described differently by each product, different formats, complicated security and access schemes, data spread over numerous locations and systems. We discuss how we created a Delta Lake to bring this data together, instrumenting data ingestion from the various external data sources, setting up legal and security protocols for accessing the Delta Lake, and go into detail about our method of making all the data conformed into a Common Data Model using a metadata driven pipeline.
This metadata driven pipeline or Configuration Driven Pipeline (CDP) uses Spark Structured Streaming to take change events from the ingested data, references a Data Catalog Service to obtain mapping and the transformations required to push this conformed data into the Common Data Model. The pipeline uses extensive Spark API to perform the numerous types of transformations required to take these change events as they come in and UPSERT into a Delta CDM. This model can take any set of relational databases (1000s in our case), and transform them into a big data format (Delta Lake/parquet) CDM in a scalable, performant way all from metadata. It can then perform schema-on-read to project from this CDM into any requested destination location (database, filesystem, stream, etc). This provides the ability for Data Scientists to request data by specifying metadata, and the pipeline will automatically run producing the schema they require with all data types conformed to a standard value and depositing it to their specified destination.