Mark Paul has 15 years experience in large scale software development. Having worked in frontend, backend, data engineering and architecture roles he has gained practical knowledge on how to build distributed software solutions that scale. He currently works for HealthDirect (An Australian Government agency), solving complex data quality issues in the Public Health space.
In a continuation of a talk titled 'How Australia's National Health Services Directory (NHSD) Improved Data Quality, Reliability, and Integrity with Databricks Delta Lake and Structured Streaming' given by our Solution Architect at the Spark Summit 2019 - We hope to present how the NHSD implemented a 'Federated Data Platform' that ingests data from multiple sources (such as Authoritative System of Record, commercial vendors etc. ) and performs data operations like validation, matching, merging, and versioning whilst generating and maintaining comprehensive data lineage, attribution and provenance in a quest to continually improve data quality, governance and completeness. We will also cover how we currently 'rank' (promote/demote) input data sources based on manual audit outcomes and how we intend to use machine learning to achieve auto classification of preferred data sources (in the event multiple sources compete to update the same data attributes). We intend to show code snippets to demonstrate key features and functionality of our platform.