Miao is an Engineering Manager at Adobe, where he works with a great team on platform engineering with Spark and other open source technologies. He used to be an active Spark contributor before changing to his manager role. His interests span on high speed networks, data center infrastructure, data processing and machine learning. Prior to joining Adobe, he worked at A10 Networks, IBM and Alibaba with various engineering roles. Miao holds a Ph.D in Computer Science from University of Nebraska. Lincoln with a focus on Peer-to-Peer (P2P) Streaming.
In today's data-driven economy, companies increasingly collect more user data as their valuable assets. By contrast, users have rightfully raised the concern of how to protect their data privacy. In response, there are data privacy laws to protect user's privacy, among which, General Data Protection Regulation (GDPR) by European Union (EU) and California Consumer Privacy Act (CCPA) are two representative laws regulating business conduct in corresponding regions. Common requirements are to access or delete all records in all ever-collected data given a specific user's search key(s) in a timely manner. The size of collected data and the volume of requests for search make enforcing GDPR and CCPA highly inefficient if not resourcefully infeasible.
In this talk, we demonstrate our work for enforcing GDPR and CCPA in Adobe's Experience Platform (AEP) by efficiently solve the search problem above. Specifically, we build Bloom Filters while saving data in Data Lake with minimal resource and maintenance overhead, which reduces nearly 10X searching time for a single search request. Furthermore, we build orchestrated microservices for splitting and scheduling extra-large search jobs into multiple smaller jobs with a balance between resources consumption and processing time. Finally, we walk through a few lessons learnt from our work of handling datasets with a larger number of files and partitions with Spark.
As the usage of Apache Spark continues to ramp up within the industry, a major challenge has been scaling our development. Too often we find that developers are re-implementing a similar set of cross-cutting concerns, sprinkled with some variance of use-case specific business logic as a concrete Spark App. The consequences of this anti-pattern are significant. Cross Cutting logic is re-implemented again and again. Each isolated Spark App is responsible for its own resiliency, scalability, monitoring, and error handling. Attempting to weave together data as it flows across these Apps is highly inefficient. Pipelining data through one or more of these apps requires multiple rounds of loading and saving data to disk increasing the overall cost and risk of failure.In addition, there is no consolidated error handling when chaining multiple Spark Apps. In this talk we will walk through the problems that led us to an extensible plugin framework, SIP, implemented to address these issues. SIP is used extensively in Adobe's Experience Platform (AEP) for data processing. The framework enables us to support a number of complex use-cases by composing one or more simpler data conversion and/or validation operations. SIP is hosted internally, allowing a community of engineers to plugin code and benefit from the resiliency, scaling, and monitoring invested in existing infrastructure. Finally, we will dive deep into SIP's detailed error reporting and how it enables us to provide a much improved user-experience to our customers.