Responsible AI: Protecting Privacy and Preserving Confidentiality in Machine Learning and Data Analytics

In this session, we will discuss a range of new emerging technologies for privacy and confidentiality in machine learning and data analytics. We will demonstrate how to use open source tools to put these technologies to work for your applications.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi everyone, I’m Sarah Bird from Azure AI. And today I’m gonna be telling you about, some of our responsible AI capabilities on Azure. When we think about developing AI responsibly, there’s many different activities that we need to think about. And it’s important that we have tools and technologies that help support us in each of these. So in Azure we’re developing a range of different tools to support us on this journey. And so in this session, I’m gonna be talking about technologies that help protect people, preserve privacy, and enable you to do machine learning confidentially. However, we also have capabilities around gaining a deeper understanding of your model through techniques like interpretability and fairness. And writing new capabilities to our platforms, as well as recommending practices that allow you to have complete control over your end to end machine learning process. To make it reproducible, repeatable, auditable, and a process that has the right human oversight built in. So a lot of these responsible AI capabilities are new and they are actively being developed by the research community and in practice. However, at the same time we’re already creating AI and we’re already running into many of the different challenges in practice as we do that. And so, we felt that it was essential that we get tools and technologies into the hands of practitioners, as soon as possible. Even if the state of the art is still evolving. And so in order to do this, we have been developing a lot of our capabilities as open source libraries, because this enables us to directly co-develop with the research community. It enables people to easily extend the capability and make it work for their platform. And it allows us also to just iterate more rapidly and more transparently. However, end-to-end machine learning is often best supported particularly in production with a platform that helps you track the process helps make it reproducible. And so in order to make it easier for people to use these capabilities in that end-to-end process, we integrate them into Azure Machine Learning. So they can just be part of your machine learning life cycle. So if we wanna dive in and talk about today’s topic, which is how do we really protect people and the data that represents them. One of the big questions that we need to think about is how do I protect privacy while using data? And you might immediately think that, you have great answers to this. And of course the simplest thing we can think about doing is making sure that we have excellent access control and that we don’t have unnecessary access to the data set. And of course, the next step beyond that is, we can think about actually anonymizing values in the data. So that, for the people that can look at the data set right, they still can’t see that private information about individuals. However, it turns out that these are important steps, but they’re not enough. Even when we use data, the output of that computation, the model that we’ve built or the statistics can actually still end up revealing private information about individuals. So if we think of a machine learning example of this, I could have a case where I’m building a machine learning model that helps autocomplete email sentences. And in the model, there could be some very in the data set, there could be some very rare cases of sentences like my social security number is. And so when I type that as a user, it might be that then the model sees that it lines it up with that singular or small number of examples. And it actually autocompletes my sentence with a social security number from the data set, which could be a significant privacy violation. And so, there’s cases where models for example can memorize individual data points. And we definitely need to consider that. But even if we think, okay then, that’s a specific problem to machine learning, there’s also challenges when we look specifically at statistics and aggregate information overall. And so, let me jump over to a demo to show you what I mean here. So in this case, I am going to be demoing in Azure Databricks. But you could be using a Jupiter notebook or your Python environment of choice.

And what I’m gonna do here is I’m actually going to demonstrate how I can take a data set here. And in this case my data set is a for loan scenario.

So I’m going to be looking at people’s income, and trying to use that as a feature to decide whether or not to offer them a loan. And what I wanna demonstrate is that we can actually reconstruct a lot of the private information and the underline data set just from the aggregate information that we were using. So in this case here, I’m going to assume that I know the aggregate distribution of incomes for different individuals as represented by this chart here. And then what I wanna show is that, I can actually take this information. And if I combine it with a little bit of additional information about two individuals, so in this case, these values here. Then I can take it off-the-shelf SAT solver. So here I’m using Z3, which you can just pip install. And I’m going to use that SAT solver to actually reconstruct a data set that’s consistent with the published information that we know. So if I run this SAT solver, then what we’re going to see is that, we actually are able to reconstruct a data set that is consistent with all of the aggregate information as well as the individual information that we know. And in this case, we can actually combine that with the, we can actually compare that against the real data sets since we know the real information and see how well our attacker is doing here. So if we look at the chart here, what we can see is that for almost 10% of people we were actually able to like exactly reconstruct their income. And if we weren’t in the range a little bit and say, okay, well let’s look at for example, within $5,000, then we’re actually able to get that correct for more than 20% of people. Now in this case of course, the attacker wouldn’t know which 20% it has correct. Although in reality, what you could do is run this attack many different times and start looking at sort of the distribution of possible values. And so, with more computational power it is still possible to get more information. And so now the question is, what exactly do we do about attacks like these?

And so this is where I think really exciting technology is differential privacy. Differential privacy actually starts out as a mathematical definition that says, if implemented correctly, you can guarantee that a statistical guarantee that you won’t be able to detect the contribution of any individual row in the data set, in the output computation. And so, that enables us to exactly line up and say, now you can’t do this type of reconstruction attack. And so, that enables us to guarantee that we can hide the contribution of the individual and have a much stronger privacy guarantee. And so the way that differential privacy does this, as I mentioned is it’s originally a mathematical definition. And since the publishing of that idea, the research community has developed many different algorithms that successfully implement this in different cases so that you can apply it. And it works through two mechanisms. So I’m going to wanna do some aggregate computation, whether that’s statistics or build a machine learning model. And the first thing I need to do is, I wanna add noise. And the statistical noise, it hides the contribution of the individual so you can’t easily detect it.

And the idea here is that in most cases particularly for this aggregate information, we should be able to add a amount of noise significant enough to get that privacy guarantee that you can’t detect the individual, but small enough that it’s a small amount of noise on the overall aggregate. And so that aggregate is still useful for our computation. The second piece is that, if you could do many queries or the right type of queries then, you might still be able to detect the individual information. And so, we need to calculate how much information was revealed in the computation, and then subtract that from an overall privacy loss budget. And so, the combination of these two capabilities enables us to have this much stronger privacy guarantee. And so, as I mentioned, this is very active area of research and there’s many different algorithms that have been developed by the research community, to implement this concept in practice, depending on your particular setting. And so, we wanted to make this easy for people to use without it being you being required to be an expert in differential privacy. Because I think it is such a promising capability, but on the other hand, it involves quite a range of algorithms to implement. And so, we partnered with researchers at Harvard to develop an open source platform, that enables you to easily put differential privacy in your machine learning and data analytics applications. And so, we’ve developed this platform and it sits between the user and the query you want to do and your data set or your data store. And so when you query through the system, we will add the correct amount of noise based on the query and your privacy budget, and then subtract the information from the budget store and allow you to track the budget. And so then, we’ll give you back that query your aggregate results, but with that differentially private noise added. So now you have the privacy guarantee and so you can go forward and use that. So let’s go back to the demo, to see what this looks like in practice. So in this case, I’m going to use our open source system and we’re going to add the differentiate private noise to that income distribution. So here you can actually see the Epsilon where I’m giving it a budget to do that. And so now we’ve actually added noise to our histogram. And so let’s check that privacy guarantee. So in this case I’m gonna redo my same SAT solver attack, but I’ll actually see a really different result which is unsatisfied. So this is great. It means that we have successfully protected privacy, at least against this type of attack. Although the great thing about differential privacy is that there’s mathematical proofs behind it. So we also know that we are protecting private if implemented correctly, we’re protecting privacy against a variety of other attacks besides this specific one that I’m demonstrating here. And then it’s not enough however to just use the, to be able to protect privacy. Cause we could also do that by just not using the data or just not having the data at all right. And so the second thing we need to investigate is how well we’re actually doing in terms of that noise. And so let’s run this and compare. So here’s comparison of the non-private and private information. And so you can see overall we’re doing pretty well. However, it is a small data set. So, there’s particular cases where you can definitely see the noise that’s being added. So that might be fine for my problem. It might be that I can tolerate a fair amount of noise, or it could be then in this case, I want to give the query more budget to use so that I can add less noise, or it might be that I want to use a larger data set or different aggregate functions to enable me to have a sit on a different point in the privacy accuracy trade off curve. So there’s a lot of options here. It’s not a fixed amount of noise that has to be added. So then if we wanna look, one of the ways that we can use differential privacy and machine learning, what I’m demonstrating here is with our open source system, you actually can generate synthetic data. So in this case, we’re creating a data set that using differential privacy, that lines up. So here’s we’re giving the budget, and it lines up with the overall trends and patterns that we wanna see in the data set at the aggregate level. But actually hides the contribution of the individual as we discussed. And so, I can use the system to generate my data set here, and then I can take that and go and do machine learning. However, this isn’t the only option.

I also instead could just use differential privacy directly in the machine learning optimizer, so that as I’m pulling in the training data, and I’m calculating the budget that I’m spending in feeding into that model. So, there are multiple options of how you might wanna use differential privacy in machine learning. But for statistics is a bit more straightforward where we can just go and add it on top of the aggregate results. So with that, I do wanna mention that this project is part of a larger initiative.


So this is really just the first system in that initiative where, what we want to to do is build the software in the open source, and use this as a way to both allow more people to adopt it, but also to advance the state of the art. The more places where we can collaborate and really try it and figure out why it doesn’t work and then advance the research and iterate the farther we can take this technology and open source really enables us to do that. The other thing is because these are difficult algorithms to implement. It’s great to have them in the open because we can have experts who can inspect them. We can verify them and we can develop tools that verify them. And so, it’s a really great place for us to really create a community around this technology. And so, the other thing I wanna mention is that, many people when they see this immediately they think about the data they already have. And why do I wanna add noise? That seems like that’s just an absolutely worst experience and you might still wanna do that. Because privacy is important, it might be that it’s worth that privacy guarantee to have a little bit of noise. In other cases, it might be that because we’re changing the trade-off curve here, and you can actually use. You can use data that you might not have otherwise been able to use. Because the baseline requirement is that there’s a privacy guarantee to even use the data. That means that there’s more data that’s available that now we can happily use for the good of society, for the good of people. Well not risking privacy. And so, I think we really believe in this initiative as a way for us to use more data on the problems that are important to people in society. Without having to make a hard trade off between using the data to solve those important problems and protecting people’s privacy.

So with that, I’m going to switch and talk about another family of technologies that

can work in combination with differential privacy. And so these are called confidential machine learning. And as I mentioned, it’s actually a family of technologies and they largely are united around the theme of confidentiality and enabling you to do machine learning or computation in a way that’s confidential. However, they have different trade offs and they can be used in different ways. And so, the first and sort of easiest one to understand,

is having technology that’s confidential from the data scientists. So you can imagine cases where, I want to be able to design my model, code up what it will be, but then I don’t want the data scientists to be able to look directly at the data. So I want them to be able to train a model on behalf of that data, but they, I don’t want them to be able to directly inspect it. So that’s the simplest type of confidential computing, and we have that capability come in. And so that will enable to have that guarantee that the data scientists can’t see the data. Now, if you want to go a step farther than you can do confidential computing or confidential machine learning, using encryption that is powered by hardware. So in this case, there’s a hardware unit called and trusted execution environment that runs inside the CPU. And all of the computation is in stays encrypted. So this really completes the encryption lifecycle. Where if I was actually,

data is encrypted at rest only now, and it’s encrypted over the network. But when it gets inside the CPU, you actually have to unencrypt it and so then, you do have that data exposed to the operating system and the CPU, and you have to trust that. So in this case, what this does is actually that you can instead keep the data and the computation encrypted inside of the CPU. So now you don’t have to trust the operating system. You don’t have to trust the cloud. And only inside of this trusted execution environment, will you actually unencrypt the computation and execute it, but then you’ll re-encrypt before you put out the results. And so this enables us to build machine learning models on encrypted data and produce an encrypted model. Or the same thing with training where now I can have a model inside of that enclave and I can send encrypted inventory requests and get the response back. And I have a much smaller trust boundary in terms of what I need to think about. The other thing that’s interesting about this technology is that it enables multi-party scenarios. Where now each of us need to trust the hardware unit, and we can collaborate on building a machine learning model, but we don’t have to expose our data to each other and we don’t have to trust each other. So, there’s a lot of interesting things that we can do on this hardware based technology. And the thing that’s great about it is because it’s inside the enclave.

Because it’s running, unencrypted inside the enclave, you can run a large amount of computation and different types of computations. So, it works in a lot of cases. However, you do have to trust that hardware unit and you have to have special hardware. And so in some cases, we wanna go a step further. Perhaps if you are stewards of someone else’s data and you actually sort of have a mandate to minimize the number of times it’s encrypted, it might be that you actually want to look at ways where you can do the computation, leaving the data completely encrypted. And so that’s where homomorphic encryption comes in. And the idea of homomorphic encryption, is that I’m going to leave the data completely encrypted, and I’m going to develop math that allows me to actually operate on the encrypted values. So in this case, certain types of computation, we can do it as encrypted one plus encrypted two will result in encrypted three. And so this enables us to actually perform this computation without unencrypting anything. And so you don’t have to have trust in the hardware, in the way that you did with the previous technology. And so we have an open source library called Microsoft SEAL, that contains meaning grade homomorphic encryption algorithms. And so, it can be used to implement a variety of homomorphic encryption scenarios. But I’m gonna jump over and demonstrate actually how we can use this for machine learning. So in this case, I’m going to set this up. And so, what I’ve done is

I’ve trained a model and my model is sitting in my model registry in Azure Machine Learning. And so I’m gonna download my model from my model registry.

And then I’m going into to set this up in the cloud for inferencing, so we can actually host it in the cloud and then send inferencing requests to it. But I want to do this using homomorphic encryption. And so, the difference that I need here is I’m gonna create my scoring file to actually inference with my model. However, I’m gonna use SEAL in this case, right. So I’m gonna use my encrypted inferencing server, and I’m actually going to build that into my scoring files so that now I can do my inferencing with homomorphic encryption. So then, now I just actually wanna use Azure Machine Learning here. And I’m gonna deploy my model using my Azure Container Instance. So now the model is going to be set up in the cloud. So now we can actually move forward and start calling it. So here let’s test this service and

what I’ve done here actually is I’m generating the keys. So I have my public and private key, and I’m going to set up the private key in the, or I’ve already started to put the public key in the cloud. So, I can actually do that. Now let’s send a call to it. So, this is the features that I wanna send. So I wanna know if this person would be accepted for a loan. However, I want to send that to the cloud encrypted. So I’m going to send that. And so this is the value encrypted that I’m sending into the cloud.

And we’re going to look at the response here.

So this is the response we received also encrypted. So that way it’s hard for the cloud or an attack or anyone to know what they’re seeing here. And so then now if we actually decrypt the results, we can see the information. So we can get the prediction in this case, this individual would be denied a loan. So this is great. It doesn’t work for all model types, but it does enable us to now, allow someone to actually use a model, but without sharing their data to the model or to the platform that’s hosting the model. So, this is really a higher level of confidentiality.

So with that, I’m going to wrap up here and say, responsible AI’s a big topic. And so, we have a lot of great resources, both on what we were talking about today which is privacy and confidentiality, as well as some of the other topics I mentioned briefly. And so, you can go to the responsible AI Resource Center, and we have a lot of different sort of practices to help your organization get started in responsible AI development. As well as links to all the tools and capabilities that we just mentioned. I also mentioned that these work either, built into Azure Machine Learning or on top of machine learning. So, you can go directly to AML and learn more about how to use our responsible AI capabilities inside of AML. And if you’re interested in getting involved in open DP or using our open source differential privacy system, you can check out the open DP community or join us on GitHub. We would love to have collaborators and contributions for the system. And the same thing Microsoft SEAL is available on GitHub. And we’d love to have collaborators and contributors there as well. So, I hope that this was a great session for you and that you use these tools and that.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Sarah Bird


Sarah leads research and emerging technology strategy for Azure AI. Sarah works to accelerate the adoption and impact of AI by bringing together the latest innovations research with the best of open source and product expertise to create new tools and technologies. Sarah is currently leading the development of responsible AI tools in Azure Machine Learning. She is also an active member of the Microsoft AETHER committee, where she works to develop and drive company-wide adoption of responsible AI principles, best practices, and technologies. Sarah was one of the founding researchers in the Microsoft FATE research group and prior to joining Microsoft worked on AI fairness in Facebook. Sarah is active contributor to the open source ecosystem, she co-founded ONNX, Fairlearn, and SmartNoise and was a leader in the Pytorch 1.0 and InterpretML projects. She was an early member of the machine learning systems research community and has been active in growing and forming the community. She co-founded the MLSys research conference and the Learning Systems workshops. She has a Ph.D. in computer science from UC Berkeley advised by Dave Patterson, Krste Asanovic, and Burton Smith.