Jun Ma is an engineer in Data Ingestion team of Adobe Experience Platform. She has been working on building a distributed system that allows customers ingest data in large volumn as well as facilitate data management. Her focus is to ensure efficiency, scalibility and resiliency of the system. Prior to join Adobe, she obtained a master degree in Information Technology from Carnegie Mellon University.
In today's data-driven economy, companies increasingly collect more user data as their valuable assets. By contrast, users have rightfully raised the concern of how to protect their data privacy. In response, there are data privacy laws to protect user's privacy, among which, General Data Protection Regulation (GDPR) by European Union (EU) and California Consumer Privacy Act (CCPA) are two representative laws regulating business conduct in corresponding regions. Common requirements are to access or delete all records in all ever-collected data given a specific user's search key(s) in a timely manner. The size of collected data and the volume of requests for search make enforcing GDPR and CCPA highly inefficient if not resourcefully infeasible.
In this talk, we demonstrate our work for enforcing GDPR and CCPA in Adobe's Experience Platform (AEP) by efficiently solve the search problem above. Specifically, we build Bloom Filters while saving data in Data Lake with minimal resource and maintenance overhead, which reduces nearly 10X searching time for a single search request. Furthermore, we build orchestrated microservices for splitting and scheduling extra-large search jobs into multiple smaller jobs with a balance between resources consumption and processing time. Finally, we walk through a few lessons learnt from our work of handling datasets with a larger number of files and partitions with Spark.