The Beauty of (Big) Data Privacy Engineering

Download Slides

Privacy engineering is an emerging discipline within the software and data engineering domains aiming to provide methodologies, tools, and techniques such that the engineered systems provide acceptable levels of privacy. In this talk, I will present our recent work on anonymization and privacy preserving analytics on large scale geo location datasets. In particular, the focus is on how to scale anonymization and geospatial analytics workloads with Spark, maximizing the performance by combining multi-dimensional spatial indexing with Spark in-memory computations.

In production, we have successfully achieved 1500+ times enhancements in terms of geo location anonymization, and 10+ times enhancements on nearest neighbour search based on anonymized geo datasets.

Speaker: Yangcheng Huang


– Hello everyone, the topic of this presentation is the beauty of big data privacy engineering. My name is Yangcheng Huang, I’m a director of software engineering with Truata. So before we move to the presentation, please allow me to do a quick introduction about Truata. Truata was a startup founded in 2018, by MasterCard and IBM. Truata is specialized in data privacy. So we’re basically offering this data anonymization solutions and services for different companies, across different industries, you know, with data processing platform, like Spark, it’s not so difficult to remove privacy information or any sensitive information from a big data set. But what Truata is specialized in is we can balance the trade off between data privacy and the data utility. That’s what we are good at. So, basically, we provide a number of consulting services, but also our big data platform to a number of different clients across different industries. We actually have they use spark in our data normalization and data analytics platform. So, the problem I want to discuss here is actually one of the very important challenge in data privacy. It’s called Geo privacy. In nowadays, everyone is so frequently with a mobile phone. So, our GPS information is actually shared and used by a number of companies, no matter you know it or not, you know, day in and day out. So, if you go online, if you search through the Geo privacy, you will find a number of interesting news, how you know, how people’s privacy is actually was introduced by different kind of GPS information leak. So, this particular program, we try to generalize the customers GPS, the customer GPS information and into a postcode level information. So that in a postcode level, so, there will be multiple customers in the same postcode so that you know, each person is not isolated based on their location information. So, the input of the problem essentially is a large data set. Is a study more than certain millions of customers GPS location information. And also the zip code information. The zip code information is basically each zip code is actually a shapefile. Each shapefile if you have experience with the geometry processing. Basically, each shapefile essentially is its shape, including a number of points, each point is actually a GPS point. So, with these two input, we try to mapping each customer’s GPS location information onto a zip code. So, that you know, as the final data set, we send for anything like advertising will be only on the date code level. Based on this, we need to further to calculate between the zip code we will try to calculate the nearest 10 zip code of so that we can infer people’s shopping behavior. So, for example, if we want to infer the shopping behavior of Ballymoss route where Truata is located in, we potentially look at the surrounding nearest to 10 zip code. So that’s basically the problem we try to solve here. Basically, the first thing we need to do a mapping, the second thing we need to do to really to calculate the nearest 10. So if you look at the problem here is essentially a trading problem, you know, it’s basically a million by meaning. So if you want to search, searching, so basically, we have 30 million customers, GPS information, you have to potentially search 1.7 million British zip code, so that we can do the mapping. So, you know, if you look at what Google is processing, like, you know, you’re doing 60 million web pages per day, 60 billion web pages per day, and Dublin population is 1.2 million. So it’s just like measuring the similarity of any two Dublin people. So this is the complexity of the problem we are facing, you know, the topic of this presentation is called you know, “The beauty of such a big data privacy engineering.” Beauty is defined as you know, the power of the quality, impersonal thing that gives the pleasure, you know, make people happy, something impressive and something good. So we can start the this showing people, showing the result of our showing you our processing result. So, our customers started the journey, using for example 200 of processing node, they started to using for example, Hive SQL, you know, we’re using a library from Esri, Hadoop Spatial Framework, if you google it you can easily find it. So, our customer use such a 200 node, they basically use Hive SQL, they basically try to do a mapping for 5 million customers onto this 1.7 million UK postcode. Is actually takes 33 hours to run the job. I mean, when Truata got this task, we basically started with, you know, a number of existing libraries, including use Spark SQL, frankly. We actually use, for example, GeoMesa, we use GeoSpark, and a number of different geospatial kind of this kind of spark library to do this mapping on 8 million, roughly 8 million customers to map these 8 million customers GPS information onto our 1.7 million postcode. Actually, the job runs 12 hours, still not finished. So I just kill the data because, you know, could be on the 12 hours is not acceptable in our SLA requirement. And basically, we the team take the job fix it, we begin to look at the problem, we basically rewriting the whole stuff in Spark Scala with tuning performance. Actually, at the end of the day, we can map 32 million subscribers onto a 1.7 million postcode within eight minutes. So basically, in the following slides are going to show you in particular more details of this work from two aspects. One is everything we used. The second one is you know, the journey we have gone through. But the result shows here is quite impressive, you know, we got for the mapping, we get 1500 times improvement. For the nearest 10 calculation I just mentioned, you know, the customer started, you know, with 200 nodes, but actually the time with the job is now finished. But Spark when we use Spark SQL, we can finish it initially with within 13 hours. And after we tune in the performance with proper spatial indexing everything we’re going to introduce right now, we can actually finish it in one hour and 10 minutes. So it’s 11 minutes 11 times improvement. So now I can introduce the algorithm and some implementation details of algorithm, you know, essentially we use R-tree as the fundamental algorithm to do this spatial indexing. R-tree is like, you know, a B-tree it’s a tree structure used for spatial access, spatial indexing, you know, they used to index multi-dimensional information, such as you know, GPS information, rectangles, polygons. The key idea of R-tree is to read to grouping the nearby objects and represent them into a bounding box. If you think about the problem we are facing just now, we try to mapping the people’s GPS information onto a zip code. So we are searching, you know, this kind of mesh between GPS and the zip code. Essentially, if the people in South London, there’s no way I should go to Costco I didn’t work to do this search. So there well I just need to search the postcode in London, right? So that’s basically the usability, why we use R-tree because you know it provided this bounding box and so, basically the search can be done in each smaller bounding box just basically you just need to search locally rather than search in whole space of the zip code. But, you know, a problem of R-tree is you know, the loading time right, so that people call the packing time. In this particular solution, we actually use the Sort-Tile-Recursive algorithm to speed up the loading. How it works is basically like what it shows here, in the bottom part of the slides. He started basically to look at the whole geo space, basically, they look at the maximum boundaries of the whole geo space we are looking at. Then after we have the maximum boundary, it began to do a partition, that’s where this Sort-Tile-Recursive function comes in. Is basically capable to partition the geo space into smaller ones, until it reaches a certain level, you know, the final, the smallest box curtains only at most, for example, nine points in this example. So it is widely used, but this is possibly the first time actually we started to use to do this our particular problem, you know, for example, there are a lot of other library like GeoMesa, like GeoSpark, they have been using this kind of to do spatial analysis, but their focus is mainly on query based focusing, if I have one GPS, I want to do a query to do some spatial analysis on this GPS data point. So that’s basically exiting algorithms for example. But you know, what we are doing is, we basically try to make a bunch of customers GPS information is 32 millions of GPS points, onto another nearly 2 million kind of postcode locations. So this kind of batch processing at this volume is continue, so. And the journey we have gone through is actually another part we want to introduce here. You know, it’s a combination of, you know, 99% of passion plus 1%, of skills, you know, the error we just introduced there, you know, the customer, the story start with, you know, the customer run with, if the, if, based on the customer runtime is potentially, you know, a few hundreds of hours of required to really to run this geo mapping function. Now, we took it, we ran it with Sparks SQL and when we actually, you know, it’s never finished here, so. But by looking at the nature of the problem, look at the logic, you know, he make four years of this low quality look low quality is actually quite essential. This is basically we can shrink the problem into a, you know, smaller scale and to solve the problem by applying this R-tree onto data frames, onto data RDD, we quickly can achieve the mapping in four minutes, it seems, you know, everyone is happy, it’s a wow factor. And the story seemed to end there. But in reality is not because we need to move this piece of information, I mean, move this algorithm into production. What does this mean? You know, move geos analytic a geos skill spatial artisan production means with which we started to introduce something like, you know, data pre processing, right. So, we need to pre process data, we need to read out the data into proper format. But then we found our algorithm actually increased it to from the runtime from four minutes encrypted to 10 minutes is still acceptable, right? It’s 10 minutes. But then we begin to look at the end-to-end problem right. We need to bring in more customer information. In addition to, you know, the TPS information we have, we also need to have, you know, other customer related information so that from the customers so that we can build end-to-end solution. After we introduce, you know, all this transactional information of the customer into this picture, we basically introduced a few joints, you know, with together with the postcode information, you know, the geometry and the postcode a string, this kind of mapping. Quite a number of joints is added into the picture. And then we found actually, our writing time increased sharply to five hours. This is expected or not expected, you know, that’s basically what each big data engineer is facing. But a lot of people actually stop there and began to, you know, this is as good as we can. But we want, our team is not stopping there, we keep looking at the problem, and we believe there should be a better achievement a better results rather than just stay with five hours. So we began to extensively tune in parameters to greedily you know, optimizing the performance. Actually, we found actually really not too much improvement. And then after we look at the partitioner, and the distribution of the data set, we found, wow, the reason is quite classical. You know, it’s basically, after we do a few joints, the data distribution of originally well distributed dataset is already distorted, you know, it’s a typical data skewness we are facing, and we begin to look at, you know, how data, essentially, we begin to look at the language of data distribution, you know, how the data distribution has been changed, before and after each is drawn. So, by looking at in depth into the problem, basically, we will begin to think about, we need to rewrite the transmissions. For each transmission, we basically remove all these unnecessary transmissions, you know, if we can use a subset of the Spark DataFrame, we directly use a subset rather than do more joints, basically, we remove unnecessary joints, remove unnecessary transformations, and we actually put the runtime back to roughly 14 minutes. So it’s end-to-end, which is, back to that kind of previous, you know, 10 minutes around that situation. And after we further optimize the code using, you know, in our previous code, we still use some Sparks SQL, not all of them, but there’s still some Sparks SQL there, we further optimize that part, we actually achieved the end-to-end run roughly at eight second. Now, that’s basically a typical journey of our big data privacy engineering. I mean, the algorithm is essential to achieve this target, but that’s in my view, that’s just the start of the whole journey, we basically just use our knowledge of Spark and keep optimizing, not just you know, about your discount with permit optimizing, we do then we begin to repartition the data, do a partitioning, create our own partitioner, you know, all this journey, if it’s not working a final state would begin to really rewriting some of the logic we found, that’s actually a good way of doing privacy of big data engineering in our day-to-day lives. So that’s potentially everybody getting in here has been experiencing from their day-to-day life, you know, it’s not just one particular algorithm will lead to a huge improvement on the performance. Is basically the combination of a good algorithm. And then we keep working on the algorithm, keep improving the solutions, and achieve the final target in our production. So and also, it’s good to mention that, you know, the SPARC forum where we were using when we’re doing this work is actually is 2.0 2.0.2. It’s not like you know, today we have 3.0 actually, I’m aware there are quite some optimization on the skewness. So, all this kind of stuff will make the work much easier today, but the journey we have down here, basically make us ready give us a lot of experience, how we optimize the performance of the work. So, this kind of journey is exactly this craftsmanship spirit here how things plays here, just like the gentleman is doing our hardware workshop, we actually working on Spark in similar way. We basically keep tweaking and tuning our performance, our tools, our platform, in such a way that, you know, we gradually make some near perfect situations in our day-to-day product and day-to-day solutions. So that’s basically our journey. Actually whilst the kind of type of journey of this big data privacy engineering, you know, in the slides basically introduced the first geo anonymization and geospatial analytics workload with Spark, using this R-tree algorithm to maximize performance, you know, and do this in-memory computations, which achieved the very good result. But also I showed the journey of production, such a workload and such a solution. It’s basically the craftsmanship spirit is really playing a key role here, by keep refining work and refining the solutions. So far, the work present here is mainly about GPS information, kind of anonymization. So we basically generate the GPS information onto zip code level is a typical way of doing automatic generalization. But in addition to this GPS information anonymization we do have some other bigger challenges. For example, even each GPS information, it’s anonymized the sequence of the this GPS information. That is the sequence that you may have the day police do a challenge there are a number of papers published on this field basically, the trajectory, even each GPS information is anonymized is still the people can still identify a person’s mobility based on the trajectory, the sequence of the day pull, the sequence of the zip code or sequence of the GPS information. So we actually published some patent in this field. Another ongoing work is now in for each trajectory, the trajectory can be combined together and build a mobility pattern. How we can anonymize this non mobility pattern, that’s another kind of work is ongoing, we try to make it a production build a whole set of solutions for this dream of privacy solution. So we try to publish this as a generic library, which can be used and in more in large scales, and also in more generic situations. That’s basically the main part of my presentation today. If you have any questions, here’s my contact. But any… Thank you for your attention. Any questions, feel free to give me a shout. Thank you.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Yangcheng Huang


Yangcheng Huang is currently Director of Software Engineering, Data & Analytics at Truata. He is responsible for Truata's core platforms for risk assessment, anonymization and privacy preserving analytics, combining cutting-edge big data engineering, privacy-preserving machine learning algorithms with multi-cloud technologies. His interests include data privacy, big data analytics, data mining and machine learning. He has published 33 patents and 30+ peer reviewed research papers. Yangcheng holds a B.Eng. and a M.Eng. from Xi'an Jiao Tong University, China, and a PhD from University College London (UCL), UK.