In a continuation of a talk titled ‘How Australia’s National Health Services Directory (NHSD) Improved Data Quality, Reliability, and Integrity with Databricks Delta Lake and Structured Streaming’ given at the Spark Summit 2019 – We hope to present how the NHSD** implemented a ‘Federated Data Directory Platform’ that ingests data from multiple sources (such as Authoritative System of Record, commercial vendors etc. ) and performs data operations like validation, matching, merging, enrichment, and versioning whilst generating and maintaining comprehensive data lineage, attribution and provenance in a quest to continually improve data quality, governance and completeness of Australia’s national directory of health services and practitioners. We will also cover how we currently ‘rank’ (promote/demote) input data sources based on manual audit outcomes and how we intend to use machine learning to achieve auto classification of preferred data sources (in the event multiple sources compete to update the same data attributes). We intend to also detail our architecture built on Databricks Delta Lake and Spark Structured Streaming.
** Launched in 2012, the National Health Services Directory (NHSD) is a national directory of health services and the practitioners who provide them. This key piece of national digital health infrastructure was established by an Australian Health Ministers’ Advisory Council (AHMAC) agreement. It is jointly funded by Departments of Health within state and federal governments and managed by Healthdirect Australia.
Mark Paul has 15 years experience in large scale software development. Having worked in frontend, backend, data engineering and architecture roles he has gained practical knowledge on how to build distributed software solutions that scale. He currently works for HealthDirect (An Australian Government agency), solving complex data quality issues in the Public Health space.
Anshul Bajpai is an enthusiastic Data Engineering geek who is currently working as Data Architect / Technical Lead at Healthdirect Australia (a Public Health Sector Company). He has 12+ years of overall IT experience in a variety of large enterprise systems ranging from Oil&Gas, Pharmaceuticals, Ecommerce, Travel etc. which includes 5+ years of extensive experience with designing, prototyping, building and deploying scalable data processing pipelines on distributed platform using Scala, Spark, Databricks Delta Lake, Kafka, Hadoop ecosystem etc. He is very passionate about solving complex problems in compute intensive big data systems involving volume, variety and velocity.