Charles is a Lead ML platforms engineer at MavenCode. He has well over 15 years of experience building large-scale, distributed applications. He has always been interested in building distributed systems. He has extensive experience working with Scala on Apache Spark
As the size of data generated grows exponentially in different industries such as Healthcare, Insurance, Financial Services, etc. A common challenging problem faced across this industry verticals is how to effectively or intelligently identify duplicate or similar entity profiles that may belong to the same entity in real life, but represented in the organization's datastore as different unique profiles. This could happen due to many reasons, from companies getting acquired or merging, to users creating multiple profiles or streaming data coming in from different marketing campaign channels. Organizations often wish to identify and deduplicate such entries or match up two records present in their datastore that are nearly identical (i.e. records that are fuzzy matches). This task presents an interesting challenge from the standpoint of computational complexity - with a very large dataset (> ~10 million) doing a brute force element-wise comparison will result in a quadratic complexity and is clearly not feasible from a resource and time perspective in most cases. As such, different approaches have been developed over the years including those that utilize (among others) regressions, machine learning, and statistical sampling. In this talk, we will discuss how we have used the Bayesian statistical sampling approach at scale to match records using a combination of KD-tree partitioning for efficient distribution of datasets across nodes in the Spark cluster, attribute similarity functions, and distributed computing on Spark.