Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA and Governance

Download Slides

Data protection requires a balance between encryption and analytics. Encrypted data is protected but limits its value. In the session we will provide a variety of practical methods that enable to encrypt the data (as required for HIPAA, CCPA, PCI and data cross border controls) – yet maintaining its analytics power. These methods include encryption at-rest and in-use, dynamic masking, selective value searches of encrypted columns and column/row level filtering (for ‘Right of erasure’.)

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– So welcome everyone to our Spark + AI Summit and session on Encryption and Masking of Sensitive Data for Spark Analytics, with a particular focus on California Consumer Protection Act Compliance and Governance.

So presenting today,

we’re gonna have myself, Les McMonagle, Chief Security Strategist at SecuPi, with over 25 years experience in information security, data privacy, and regulatory compliance. And we’ll have a demo by our CEO of the company and founder Alon Rosenthal, who has also been almost 20 years in the information security space.

He had his own company ActiveBase

that he built up from scratch, invented dynamic data masking, and sold it to Informatica, which became their DDM product. And then started SecuPi about five years ago with the idea to build what he felt was that the next-generation security solution.

So on the agenda, then, first, we’ll be talking about achieving that balance between data protection and analytics value. You’ve got two very competing forces that are opposing each other. It’s how to spend a lot of time talking about how to achieve that balance and satisfy both requirements satisfactory. And then looking a little bit about the difference between Hold Your Own Key encryption versus Bring Your Own Key or Cloud provided encryption.

Plethora of Data Privacy Regulations

Yeah, so first, there’s a plethora of different data, privacy regulations around the world. Which you see in the map, here is just a sampling or representative sample of some of those regulations. Clearly we’re gonna talk specifically about California Consumer Protection Act in this particular session, but the one common thread you’re gonna see across all of these is that these are all generally converging on internationally accepted set of standard privacy principles. Things like, you only collect the data that you need to provide the service that you’re contracted or offering your customers or consumers of the data subjects that you only keep the data for as long as necessary to provide that service, that you honor that data subjects rights to have their data deleted or removed when they want, that you wanted their consent or preference management, what they’ve opted into and what they’ve not opted in to, or opted out of, that you honor that, and you do it in a way. So these kinds of privacy principles are gonna apply everywhere. And if you really keep those in mind and you go into this with this privacy by design from the beginning and build privacy compliance into each new initiative, it’ll be a lot more streamlined. You’ll have much better compliance with a lot less effort, and you’ll do it in a more cost-effective way than you would if you tried to bolt it on afterwards.

Common Use Cases Privacy Regulations

So let’s look at a few common use cases. First, you’re gonna have situations, where you’re gonna have to have special controls restrictions around VIP or Celebrity Customers. When Beyonce checks into a hospital, then everyone working at the hospital wants to know what she’s in for. When I check into a hospital, the only person that merely cares is the doctor treating me and the x-ray tech that’s gotta do the X-Ray. So the same thing happens with casinos with high rollers or banks with high net worth individuals, especially if it’s an investment bank operation. So every organization is typically gonna have those records that require special treatment or special consideration. Another one is Cross Border data flows, where different privacy regulations have restrictions on where data’s stored physically which is becoming more and more of a challenge especially when the organizations move to the Cloud is nowadays you become… You have less and less control over where the data is physically, or it’s just more abstracted from what’s going on. Most organizations, they really need to focus more on the logical location of the data, who gets to see what data under what circumstances, depending on where they’re accessing the data from, as opposed to where the data is physically stored. And when you moved to the Cloud, you might not even have any control over where the data is physically located. That might be completely within the span of control of the Cloud provider themselves unless you have special requirements in your hosting contract with them. And then the Consent and Preference Management, as we mentioned earlier, this is a big part of providing any privacy compliance. Is making sure that you’re only using the data for what you’ve gotten permission to use it for. And you’re only keeping it for as long as you should. You’re not sharing it with anyone that you haven’t specifically stated and got permission that you were gonna share it with, et cetera. In the US especially, most organizations if they’ve ever suffered any financial penalties for privacy violations, it’s usually been a contractual violation of their own privacy policies, where they clearly stated, “We will not share your data with anyone “under any circumstances,” and then they went and did it anyway, and you found that out. In almost all cases in the US, it tends to be somebody just didn’t do what they said they were gonna do. So another very important thing to keep in mind is, “Say what you’re gonna do, “and only do what you said “when it comes to your privacy policy for your company.” And then the last one, Real-Time Behavior and Analytics Monitoring. This becomes very important to be able to establish what is normal access or use of the data. With any organization, it’s gonna be hard from right upfront to know exactly how a data scientist or data analyst is gonna have to use that data. So being able to collect the information that lets you see who’s accessing what data under what circumstances, what applications they’re using, what columns and they need to see how many records are typical access and so on. That lets you establish that baseline or profile for that particular user or peer group and lets you then be in a much better position to detect anomalous or suspicious activity for that group or a particular user.

Balancing Two Opposing Forces

So now we come to this the balance between these two opposing forces. On the one end, you’ve got the data protection and privacy compliance rules. Personally Identifiable Information or Private Health Information has to be protected. An access strictly controlled on a need to know basis. There’s quite a few laws like CCPA that now are enforcing this. You’ve also got to manage all that consent and preference management that we talked about. Well, all of this kind of flies in the face of the advanced analytics and monetization of the data where you really want to have completely unlimited access to all the data at all times using any application for anyone of your data analysts or data scientists, you wanna have complete flexibility on the mobility of the data and where you host it and why you might if AWS is offering a deal over as your one month you might wanna move it over. You got elastic consumption of storage and compute capabilities that you spin up and spin down. And it’s very important that you maintain that freedom to leverage any and all analytic tools, but at the same time still be able to apply consistent privacy compliance.

Essential Data Protection & Compliance Capabilitie

So when you do this, and there’s a few different specific things that you need to make sure that you do. First, that Data Loss Prevention sort of category, being able to prevent abuse, being able to detect when a trusted insider is doing something malicious or unauthorized or abusing the privilege that you’ve granted them or worse when their credentials somehow have been compromised, and you’ve got an external hacker or other rogue employee that’s gotten hold of someone else’s credentials, and they’re accessing data through an account that would otherwise appear to be authorized to see that data in the clear. So being able to provide very fine-grained access control, being able to lock down and establish a profile for how one user uses the data versus another, puts you in a much better position to be able to detect them anytime you see suspicious or anomalous activity. Being able to control the cross border data flows or the geo-fencing of access to the data. And then on the Governance side, of course, you’re gonna have very specific or prescriptive rules that have to be enforced within the different regulations. Like supporting someone’s Right of Erasure. That’s a really clear definitive functionality that has to be a requirement that would be built into any data set or data repository or application that is accessing personally identifiable information.

Being able to support someone’s Right of Erasure et cetera. And then on the Encryption side, most privacy regulations don’t mandate or require particular data to be encrypted. But encryption, when used appropriately, is a very powerful tool to prevent unauthorized access or unauthorized exposure to any sensitive data. And for particularly sensitive fields like social security numbers, credit card numbers, a few others date of birth, and so on. These are quite often selected then for adding an additional layer of security where you apply Column-Level Encryption or Field-Level Encryption or tokenization to those particular fields. And in the case of outsourcing to the Cloud, to be able to use encryption to anonymize records before they’re copied to the Cloud, so that if I’ve changed the first name and the last name and the date of birth, email address, phone number, social security number, I’ve encrypted each of those before I hosted at a Cloud provider. I don’t need to worry about that Cloud provider ever being able to see any of that data in the clear, or I’m not gonna suffer any damages and neither will my customers if there’s some inadvertent data breach at that Cloud provider.

BYOK (Bring your Dwn Key) versus HYOK (Hold Your Dwn Key

So when you look at the two different ways that this can be done. On the left, we have the normal Cloud service provider encryption. This can be applied at a disc level, at a file level. They encrypt entire database, for example, or where they apply it at a column level. So either have situations where the Cloud provider gives you a solution for doing that. It’ll be unique or different for each Cloud provider. So if you’re using a hybrid Cloud environment or a mixture of on-prem and in the Cloud it’s gonna be challenging if you have a separate encryption solutions on each platform. And a lot of Cloud providers, then their response to this was, “Will you give us the keys? “You bring your own key. “You give us the keys you wanna us to use “to encrypt the data.” So you control the keys. But ultimately you’re not controlling the keys ’cause you’re really just handing over the keys to somebody else for them, and you’re trusting them in the same way, to only use them the way they’re supposed to be used. So any of these Bring your Own Key solutions, or Cloud provided solutions, have limited applicability when it comes to very sensitive or regulated data. And for most organizations (clears throat), they simply don’t fit the trust model that their internal compliance or information security group is trying to maintain. Now, on the other hand, the Hold Your Own Key concept, you have the same where you can check the box that the data is encrypted at rest, but you can have the exact same keys used to protect the data, whether it’s on-prem or in the Cloud, and the keys remain on-premise. You encrypt those particularly sensitive fields enough to anonymize the records before they’re ever leave your organization. And they’re only decrypted on the fly at runtime for authorized users on consumption. So now, there is no data breach that could happen at a Cloud provider or at the data layer that can expose any of that sensitive data in the clear. So it’s a much better approach. It’s a much better trust model for a lot of organizations and quickly becoming sort of the standard for Cloud implementations, and even organizations like Gartner are now recognizing this and making recommendations to that effect.

So when you look at this then, you start right from the data flow from end to end protection of the data.

Data-Centric security & Privacy Example for all AWos Data services Apply Column-level Encryption (at rest & in transit), Fine-Grained Access Control to ALL sensitive data with Dynamic Masking, Anonymization, Accountability, Audit Trail and UBA across ALL AWS Workloads

One single pane of glass, access control and protection solution, so that you wanna have the data and that column level encryption applied right at the ingestion point. So, whatever ETL tool you might be using to load the data into your Cloud or on-premise data repository, you wanna apply the encryption keys before the data is loaded into the data repository. Then it remains in its encrypted form anytime it’s hosted in the Cloud on the computer data layers, regardless of your platform, whether it’s Spark or anything else. And then to decrypt at runtime based on various user or data attributes, locations, the application someone’s connecting from, their job title, what role memberships they have, that they would only decrypt whatever columns are needed on a need to know basis for that particular user at runtime. And having this where it’s a centralized solution that’s managed from one point is much more cost effective and more consistently applied rules than having individual proprietary solutions that are unique to each platform.

So a common customer or a common situation that we have at one of our customers, one of the Blue Cross Blue Shield organizations, there they were using primarily talend and a little bit of nifi for loading data into the Cloud.

BCBS (Healthcare) – Case Study

They were using Azure. They would selectively encrypt individual columns before loading them. So even though they might only be encrypting seven or eight columns within customer transaction tables and so on that are loaded into the Cloud, they’re still controlling access at a fine grained-level to hundreds and hundreds of columns across all different tables independent of the ones that had that extra layer of security that encryption at rest. So then as your Cloud, of course, you’ve got only anonymized records. And then regardless of whatever tool someone’s using to access that data, whether it’s Spark or anything else on the consumption side, you’re unprotected the data at runtime for the authorized users. And this satisfied all of Blue Cross Blue Shield’s requirements. They wanted the patient ID and several other particularly sensitive fields like social security number to be encrypted at rest. They wanted to have (clears throat) a complete data discovery and data flow mapping capability that would allow them to gain insight into how the data is being used by who and under what circumstances. And that helped them to fine-tune their fine-grained access controls that they were putting in place, and they were insistent that it had to be a Hold Your Own Key concept. They weren’t allowed to go to the Cloud unless they could show that these records were gonna be adequately anonymized or at least should anonymized before being loaded to the Cloud. And they needed to support a wide range of different platforms. They were obviously a heavy Spark user, but also had other platforms that they had to use at the same time.

So what we’re gonna do now is take a few minutes to have Alon run a demo on Spark, letting you see how even though you’ve got columns that are encrypted at rest within a data repository, you can still access the data. You can still perform joins on those protected fields. You can still do searches, et cetera. That pretty much all of your analytics can be done in a way that’s transparent to the end user. And we’ll show you a few different examples on Spark and Kafka. And then we’ll back to continue the session. – [Alon] Welcome to this demonstration on how SecuPi can protect Spark and similar systems. Here is a Jupyter notebook, which I will use to interact with our Spark system. This particular Spark system includes a special ingredient, namely SecuPi, which is currently embedded into the Spark application. It is now watching everything going on in Spark, and it will react if and when necessary. Spark is getting customer data from our BigQuery system.

This is how that data is stored in BigQuery. We also have this detailed table associated with the customers table, which we usually join with the customers table on the SSN column. The join looks like this.

Now back to Jupyter. I will now initialize my Spark session by running the first cell. I’m connecting to Spark using a user called yarn, which in SecuPi, we set to be an analyst that must work with encrypted data. Now I will initialize my data frames that will read from the two tables in BigQuery. And if I run this command that will show the data, the data is presented to me like this.

The sensitive fields, namely, these four had been encrypted in my view. But despite that, I can still run a join with the other table, because that other table also has a field SSN, which is encrypted in the exact same way as the SSN field here. So when I run the cell, I get this.

Switching gears a bit, let’s now consider a scenario where I upload my data into BigQuery already in encrypted form.

I have done that and uploaded the data into this customers enc table. And it looks like this. I’ve done the same to the customer’s details enc table. Then in SecuPi, I’m changing the role of yarn so that I get to see data in decrypted form.

Now back in Jupyter, I run these cells, and this is what I see.

The encrypted sensitive fields are now shown to me in clear text. As if I were working with a table that was not encrypted at all. I can also run joins like this one.

Again, I am able to receive data because the SSN field in both of these tables were automatically decrypted for me in a consistent manner, which allows the joints to work.

In our setup, we embedded SecuPi in the Kafka producer so that when we stream the data into the Kafka cluster, we implemented a rule that will encrypt one of the fields here, namely the first name. So when the data landed in the cluster, they looked like this. With the first name fields encrypted. Now on the other side, we will query the data here using KSQL.

Now, in KSQL, we also embedded SecuPi in it so that it will implement the rules that we implemented or defined in SecuPi. On the left-hand side, I will log in as user adam.

And on the right, I will log in as user tom. In our SecuPi rule, we specify that adam is authorized, meaning he can see data in clear text, while tom is not authorized. Now I’m going to run these queries on both sides.

And as you can see, they don’t see the same thing. Tom continues to see the data in encrypted form while adam is now able to see the data in cleartext. And we can go beyond that. I can run as adam a query like this, where I’m using a clear-text value in a where condition in the query.

And I get a result, even though that in the Kafka cluster, we don’t have a role where the first name is equal to this value. So even no matter how complicated the query is that adam is executing, he can rest assured that the table that he’s actually querying will be assumed to contain clear-text value, even though, in reality, it doesn’t. – So now that you’ve seen a few quick examples of how this can work in Spark we’d like to go into a quick poll question for the audience.

And if you could please answers, just this one question. How important is it for your organization to apply column-level encryption or Hold Your Own Key prior to data being hosted in the Cloud? You may find that this is essential to your organization and it’s critical like Blue Cross Blue Shield did, for example, maybe just a nice to have, but you don’t think it’s a requirement or it’s not required perhaps because your organization doesn’t host any sensitive or regulated data in the Cloud, or you’re not even doing anything in the Cloud. And it’s not a question you even have to think about. Or are the Cloud hosting providers file-level encryption key management features of functionality that they offer, are those considered adequate by your information security and privacy people, as fully trusting the outsourcing of that to your Cloud provider. So you can just pick whatever one of these you think is the most appropriate for your organization. That would be most appreciated.

Thanks, everyone, for your responses. Most appreciated. We’ll certainly make those available if they’re not available before the end of the session, the results that we got back.

So this is clearly something that a pretty wide range of customers that we see in the marketplace really care about this. Being able to provide this privacy compliance and very granular data protection is a very high priority for any organizations processing sensitive or regulated data regardless of what country you’re operating in. And the data protection is a very complex global issue. It’s not as simple as most organizations think it might be initially, but you can always… Your business may be global, but regulatory compliance is still local.

And we’d like to stop now, and going to spend the rest of the time just for open Q&A. We’ve got several people online, Alon and myself and a few others that will be able to respond to any questions you have in the chat window. And we’ll try and answer as many of them as we can in the time that we have left in the session.

And then just while we’re providing answers to those questions, if we could have everyone also complete your feedback

using the normal channels provided by the event and look forward to seeing you hopefully in person next year, rather than just over the video conference. Thanks for taking the time. And we’ll stay online to answer questions till we’re out.

Watch more Spark + AI sessions here
Try Databricks for free
« back
Alon Rosenthal
About Alon Rosenthal

SecuPi Security Solutions Ltd.

Alon is a serial entrepreneur, the co-founder & CEO of SecuPi. He has invented and built the first dynamic masking platform in his first successful company and has won twice the Gartner cool vendor award. For the last 15 years he has been building platforms for protecting data privacy across business applications in the largest corporations worldwide.

About Les McMonagle

SecuPi Security Solutions Ltd.

Les has over twenty years’ experience in information security consulting and advisory services. He has held the position of Chief Information Security Officer (CISO) for a credit card company and ILC bank, founded a computer training and IT outsourcing company in Europe, directed the security and network technology practice for Cambridge Technology Partners across Europe and helped several security technology firms develop their initial product strategy. Les founded and managed Teradata’s Information Security, Data Privacy and Regulatory Compliance Center of Excellence, was Chief Security Strategist at Protegrity was Vice President of Security Strategy at BlueTalon and is now Chief Security Strategist at SecuPi. He is a frequent speaker on data centric protection and privacy at various conferences and trade shows.