Timo is a Product Manager and Architect at MavenCode. He has close to a decade of financial data modeling experience working both as an analyst and strategist in the energy commodities sector. At MavenCode he now works closely with the engineering teams to solve interesting data modeling challenges
As the size of data generated grows exponentially in different industries such as Healthcare, Insurance, Financial Services, etc. A common challenging problem faced across this industry verticals is how to effectively or intelligently identify duplicate or similar entity profiles that may belong to the same entity in real life, but represented in the organization's datastore as different unique profiles. This could happen due to many reasons, from companies getting acquired or merging, to users creating multiple profiles or streaming data coming in from different marketing campaign channels. Organizations often wish to identify and deduplicate such entries or match up two records present in their datastore that are nearly identical (i.e. records that are fuzzy matches). This task presents an interesting challenge from the standpoint of computational complexity - with a very large dataset (> ~10 million) doing a brute force element-wise comparison will result in a quadratic complexity and is clearly not feasible from a resource and time perspective in most cases. As such, different approaches have been developed over the years including those that utilize (among others) regressions, machine learning, and statistical sampling. In this talk, we will discuss how we have used the Bayesian statistical sampling approach at scale to match records using a combination of KD-tree partitioning for efficient distribution of datasets across nodes in the Spark cluster, attribute similarity functions, and distributed computing on Spark.