The abundance of data as well as regulations protecting people’s privacy created a need for protecting private and personal information in a scalable and efficient way. Personal data includes sensitive and private information such as health records, banking transactions and frequent locations. One of the challenges of data anonymization is when the data anonymity increases its usefulness for analytics or research decreases. This paper provides an implementation of Top-Down Specialization algorithm for data anonymization in parallel using Apache Spark which aims to balance data utility and data privacy. Performance evaluation is done on large datasets of up to 20-million rows in a variety of different cluster environments. The talk analyzes the different speedups achieved using different data sizes. It also discusses changes made to the algorithm to improve performance such as determining partitions size, determining what should run on the driver and what should run on the executor as well as scale-up experiments of the algorithm. Web page for the topic proposed including slides, code as well as the research paper I wrote is here: micophilip.github.io/comp5704/
I am a Senior Software Developer with 7 years experience in software development and 4 years in team leadership positions. I am currently working towards my master's in Computer Science at Carleton University in Ottawa, Canada. I have been working with Spark for 4 years in predictive modelling and data privacy transformations. I recently wrote a research paper part of my master's degree on using Spark for de-identifying datasets using Top Down Specialization technique. Companies I worked for include D+H, IBM and currently with IQVIA.