Anshul Bajpai is an enthusiastic Data Engineering geek who is currently working as Data Architect / Technical Lead at Healthdirect Australia (a Public Health Sector Company). He has 12+ years of overall IT experience in a variety of large enterprise systems ranging from Oil&Gas, Pharmaceuticals, Ecommerce, Travel etc. which includes 5+ years of extensive experience with designing, prototyping, building and deploying scalable data processing pipelines on distributed platform using Scala, Spark, Databricks Delta Lake, Kafka, Hadoop ecosystem etc. He is very passionate about solving complex problems in compute intensive big data systems involving volume, variety and velocity.
In a continuation of a talk titled 'How Australia's National Health Services Directory (NHSD) Improved Data Quality, Reliability, and Integrity with Databricks Delta Lake and Structured Streaming' given by our Solution Architect at the Spark Summit 2019 - We hope to present how the NHSD implemented a 'Federated Data Platform' that ingests data from multiple sources (such as Authoritative System of Record, commercial vendors etc. ) and performs data operations like validation, matching, merging, and versioning whilst generating and maintaining comprehensive data lineage, attribution and provenance in a quest to continually improve data quality, governance and completeness. We will also cover how we currently 'rank' (promote/demote) input data sources based on manual audit outcomes and how we intend to use machine learning to achieve auto classification of preferred data sources (in the event multiple sources compete to update the same data attributes). We intend to show code snippets to demonstrate key features and functionality of our platform.