Data Security at Scale through Spark and Parquet Encryption

May 26, 2021 12:05 PM (PT)

Download Slides

Big data presents new challenges for protection of privacy and integrity of sensitive information. Straightforward application of traditional file encryption and MAC techniques can’t cope with staggering volumes of data, flowing in modern analytic pipelines.

 

Apple addresses these challenges by leveraging the new capabilities in the Apache Parquet format. We work with the Apache Parquet community on a modular data security mechanism, that provides privacy and integrity guarantees for sensitive information at scale; the encryption specification has been approved and released by the Apache Parquet Format project. Today, there are two open source implementations of this specification – in Apache Arrow (C++) and in Apache Parquet-MR (Java) repositories. The latter has just been released in the parquet-mr-1.12 version – which means the Apache Spark and other Java/Scala based analytic frameworks can start working with Apache Parquet encryption.

 

In this talk, Gidon Gershinsky and Tim Perelmutov will outline the challenges of protecting the privacy of data at scale and describe the Apache Parquet encryption technology security approach. We will give a quick intro to usage of Apache Parquet encryption API in pure Java and in Apache Spark applications. We will also discuss the roadmap of the community work on new encryption features and on deeper integration with Apache Spark and other analytic frameworks. Finally, we will show a demo of the Apache Parquet modular encryption in action, sharing our learnings using it at scale.

In this session watch:
Gidon Gershinsky, Lead Systems Architect, Apple
Tim Perelmutov, Data Engineer, Apple

 

Transcript

Gidon Gershinsk…: Okay. So are we going to talk about the data security at scale is through Spark and Parquet Encryption. I am a Gidon Gershinsky. I design and build the latest security solutions at Apple. I am active in Apache Parquet community work on data encryption, and we have Tim Perelmutov, who works on data ingestion analytics for iCloud. Agenda for today, we’ll talk about the Parquet Encryption goals and features, the status in the Apache projects. We will learn how to use API and how to write single “Hello World” applications. We’ll discuss community roadmap, and we’ll talk about demo and learning for use in the Parquet Encryption scale. So as you know, Apache Parquet is a popular columnar storage format. It has a number of useful features, such as encoding for data compression and advanced data filtering, columnar projection, where new skip columns are required for your queries, and predicate push down, where your skip files or parts of file, which are not required [inaudible].
So the performance benefits of Parquet filtering is quite obvious. You have less data to fetch from the storage. You save value and time, and you have less data to process. So you’ll save CPU and latency. The question we ask, “How do we protect the sensitive Parquet data?” And we have designed and developed a technology called the Parquet Modular Encryption, which fills a number of goals. And so first and foremost is a protection of sensitive data trails. Here, we mean two things. One is the protection of data privacy, where we hide sensitive information. And the other goal is protection of data integrity, making it the tamper proof against attacks.
Another goal was a to preserve performance of analytic engines. We ran it, obviously, as an encryption. So we have all sorts of wonderful Parquet capabilities, columnar projection and the predicate push down. [inaudible] the data are working just as well. And here we have this big data challenge, integrity protection. If we sign a full file, we’ll a break the filtering, which will slow down the workloads. Additional goal we set is to try and define it as an open standard for safe storage of analytic data, which works exactly the same in any storage, cloud or private file systems, objects stores, archives, and so on. Obviously, they don’t have to be trusted. It should work with any key management service in your platform. And it provides the same key-based access in any storage. Also, you can encrypt the data in a safe private environment, ship it to public cloud storage, and once you to drop it to archives, it will stay protected. And we also support different keys for different columns in status.
Okay. So here we also address a number of challenges, like safe migration of data from one storage to the other. You don’t need to import, decrypt, export and encrypt your data anymore. You just move the files from one storage to the other. You can also share a data subsets or columns in a table. So again, no need to extract a copy for each user. You just provide the keys to the eligible users. Okay. So we have two sets of capabilities. One’s privacy protection, while the other’s integrity. So for privacy, we have a number of modes, full encryption mode means that we encrypt everything. Every module of data and the metadata in the Parquet files are encrypted. But we also have a more relaxed plain text footer mode, where a footer of a Parquet files is exposed for legacy readers so they can see your plain text columns encrypt files, if you need it. Sensitive metadata is still hidden in this mode as well. Like I mentioned, you can use different keys for different columns. This is where you get access control for the columns. And it’s a client side encryption from the storage point of view so the storage backend. So that means they never see their data or their keys.
Okay. Those are all capabilities of data integrity, making sure that the file contents are not tampered with. There are many ways to attack integrity of files. A few bites can be replaced, which is easy to detect. But the most sophisticated attacker would replace pages in your Parquet files or column chunks. So we can have a protection against this attack as well. It works, how it works, you don’t have to do anything. If we detect a tamper with the contents of file, we throw an exception. Another type of attack is [inaudible] attack. The attacker would keep your old data history of your data and would replace your current data with old version, which is tampering, even if files are not modified. So here again, we assign ID to each file, and there is protection against this kind of attack. We use AES GCM cipher, it’s called authenticated encryption, for that. And then we have a framework for other algorithms too.
Okay. We use a standard practice of envelope encryption. Basically, the file models are encrypted with a so-called Data Encryption Key, a DEK. And the DEKs encrypt it with Master Encryption Keys (KMS). And the result is called key material. And the storage is in the Parquet file folder or in a separate file in the same folder. We’ll see why in a second. The master key is managed in your key management service with access control and so on. And we also have an advanced mode in the Parquet Encryption, double envelope encryption, where data keys are encrypted with key encryption keys, DEKS. And those DEKs are encrypted with the master keys. This is useful for optimization of interaction with the KMS. KMS servers can be slow. They often are. So here you can have a single KMS call in the lifetime of the process or one call in X minutes. It’s configurable.
Okay. The current status. First and foremost, it’s open-source community of work. So thanks a lot to many contributors to this technology. So we have a Parquet community with a format repository, which was approved and released with a specification of this technology a couple of years back. We have the Java implementation of this technology in the Parquet [inaudible] project, which was released just recently in version 1.12. We have Apache Arrow community, and the C++ implementation there, which was merged also this year. And we have ongoing work on the Python interface for it. And of course we have Apache Spark community, which has delivered the updated Parquet version to 1.12 in the master branch in the gate, which enables basic encryption out-of-box. And this is planned for Spark 3.2.0 release. And there is ongoing working in other analytical frameworks on integration of this technology.
Okay. So how does the spark work with Parquet Encryption by passing the standard Hadoop parameters? So your pass is a list of columns to include. You specify the IDs of master keys for these columns. You also specify the ID master key for the footers, and you give the name of the class for your KMS client. And you activate encryption also using Hadoop parameter. We’ll see [inaudible 00:09:29]. More detailed instructions you can find in this Parquet ticket and even try it now. Just clone a Spark repository and build a runnable distribution. And that should be enough. Okay, so how to write encrypted Parquet files? You run a spark-shell. You arm or activate encryption, using this particular product that I just [inaudible 00:09:58]. And you pass the master keys in this particular demo. It’s not a real KMS. So we use a mock KMS, and you have to give master keys explicitly. So we have two keys, k1 and k2. Now, you’re ready to write your files and standard Spark API. So you can keep the column A with a key name, the k2, and you’ll keep the footer with k1. And you’re done. You can encrypt different columns with different keys, as I mentioned. So here’s a full format of this parameter.
Okay. To read encrypted files, you have to work even less because most of the metadata is already in the Parquet files. So you run a spark-shell. Again, you arm or activate encryption or decryption using exactly the same parameter. You just copy paste it. And you pass master keys in this demo, mock KMS client. And [inaudible] you read using the standard spark.read.parquet function. Okay, in the real world, you would keep your master keys in the KMS system. So you need to develop a client for your KMS server. And this client would implement KMS client interface in the Parquet library. It has basically two methods. One is the wrap key, and the opposite is unwrap. The wrap key means we give you the data key bytes. And you go to your KMS and keep data key with the master key with its ID and to give us back the result. We keep it. And when you read the data with unwrap key method, we give you the key material. You go to your KMS. You unwrap the key material, and you give us the data key back. That’s it.
Here’s an example of such a KMS client for open source HashiCorp Vault. You can find this class in GitHub of Parquet. And for Vault, you specify the KMS client. You will give a token, which holds your authority. The worker is given a key, and you point to the server [inaudible] for this KMS. Okay. And then now, again, you’re ready to write your data frame using exactly the same API. So you will use a k1 for footer and k2 for column A. And you can read the data. Okay. We have a couple of advanced key management feature, like minimization of KMS calls. We already talked about a double envelope encryption. It is activated by default. You can disable it, if you want to. And again, it gives you an option to call KMS once in a process lifetime or once in X minutes per master key.
Okay. Key rotation is a standard part of envelop encryption. So you have to refresh master keys, periodically on demands. There are many reasons for doing that. And to enable a key rotation, you have to specify this parameter when your writing the data. It will keep the key material externally in small files next to Parquet files. So Parquet files are mutable. It’s big data. You don’t want to touch it. So when you perform a key rotation, you will change or modify only those small key material files. So first, you rotate the master keys in your key management system and send your end product key toolkits [inaudible]. And the call rotates master keys message, pointing towards a folder with those Parquet files, key material files. Okay. You can work with raw Java just as well. So basically, you create the same Hadoop configuration object. You fill it up with the same set of parameters. You write your data using the standard Parquet API. You read the data using a standard Parquet API. Okay.
So now to the subject of performance. How does encryption affect speed of your workloads. And here, the main thing to remember is that AES ciphers are implemented in the standard CPU hardware of today. We use the so-called AES-NI interfaces. And it’s around gigabytes per second in thread. It’s extremely fast. It’s much faster than anything that the software stack will do. So the application, the framework as a Parquet compression, they would probably be slower than the encryption speed. With the C++, your KMSs function with standard OpenSSL EVP library. And with Java, you don’t have to do anything. Or we don’t have to do much because the HotSpot engine in the Java, since Java 9, automatically sends AES cipher operations to CPU hardware. There are some problems in the Java 9 and the Java 10. We talked with Java capability, and they fixed this issue in the Java 11.0.4 so now AES GCM works fine. So thanks a lot to the Java folks for that. So the bottom line, encryption probably won’t be your bottleneck. There is application workload. There is I/O of data. There is encoding compression and so on.
Okay, the community roadmap. So we have a number of things. Right now, we have Apache Spark work on a so-called two-tier encryption key management, which further optimizes KMS interaction. We have a number of new features in the Parquet pipeline. So we work for future Parquet versions on things like uniform encryption, where you can wrap all the columns in your table with the same key, more efficient manner. We will have CLI for encrypted data and other things. Apache Iceberg, Presto, and Hudi communities work on integrating Parquet encryption in their frameworks. And then Apache Arrow have work on the Python API for Parquet encryption.
And then I would like to head over to Tim.

Tim Perelmutov: Hi, now I’m going to illustrate how a Parquet modular encryption is used in Apple in one of the projects. iCloud CloudKit Analytics is a system for collection. Storage and analytics of a sample of iCloud services metadata databases. [inaudible] perform a sample, and we maintain a number of user cohorts. For example, we have an iCloud wide sample of all the iCloud users that contains about 0.1% of all the users. We have semantic and geographic cohorts, we can create ad hoc cohorts as needed. iCloud CloudKit Analytics performs weekly ingestions of the records from iCloud services, metadata databases, as well as continuous ingestion of certain types of user activity data. Again, sample based on user’s membership in the cohorts. After we ingest the data, we remove any identifying information from the records and organize the data on disk in the hundreds of partitioned Hive tables. Then data is registered with Hive, and it becomes available for analytics through notebooks services. Analysts can perform Spark queries, as well as we have a number of pre-configured batch workflows, which produce data on weekly cadence. The results are used to power weekly reports, as well as used in other iCloud services.
Next. So how do people use our data? One of the major power user of our data is iCloud Storage system. They use advanced machine learning to determine which classes of storage are assigned to specific types of user data. They use our system for forecasting future storage needs, as well as for determining the current state of the system. For example, they can know how much of the data is eligible for deletion or compaction at any given time. Other services use us for storage utilization and spike analysis and for anomaly detection. During the release of new functionality, service owners can use us for seeing the impact of the changes in the functionality on their system, as well as for data integrity verification. Also through ad hoc analysis, people get access to the data in our systems that also exists in other systems like Splunk, for example. But our system provides much better performance.
So because of, since it’s the nature of our data, we needed a way to protect the system from unauthorized access. And we had the number of requirements that are outlined here on this slide. I’m not going to go in details through that, but when we were looking for a system for encryption, Parquet modular encryption was a natural choice for us, as soon as it became available in Apple. So we performed the switch very recently. And here I’m going to outline our experience. During the switch, we had to update our ingestion pipelines, and we needed to just do very minor configurations. First of all, we needed to have correct Java libraries available to our system, as well as set a few properties. And after that, pipelines just ran. And again, we don’t went in details through all the configuration parameters here. I’m just illustrating that it was very few properties that we had to change. And after that, we were able to write data. And in order to read the data, again, we only needed very few properties, even smaller number of properties needed to be able to read the data.
I want to also note that we have, ingestion pipelines are [inaudible] outside of Spark. And again, it was very simple process to update all the dependencies needed. After we performed magnifications of our pipelines, we also tried to measure the impact of the ingestions. Because we have big monolithic systems that perform a lot of additional operations, it was very hard to isolate the impact of performance from Parquet modular encryption. Also, in our configuration, we had all the data encrypted, including all the columns and footers. And we chose the model where the key material is stored externally for future easy master key migration. And we performed the measurements in our QA system, and we were not able to really detect any changes in our storage or in resource utilization. So we didn’t see impact on CPU, memory, or disk storage utilization. And the latency was same. Also again, on rather small test datasets, we were not able to see impact on reading performance as well. So for example, in just one of a rather large query that touches two largest databases, our two largest tables, running the query without encryption took 23.4 seconds and with encryption 25 seconds.
So just to summarize, the process for adoption of Parquet modular encryption was very straightforward, and we didn’t see any major impacts on the performance of our system.

Gidon Gershinsky

Gidon Gershinsky

Gidon Gershinsky designs and builds Data Security solutions at Apple. He plays a leading role in the Apache Parquet community work on big data encryption and integrity verification technologies. He's ...
Read more

Tim Perelmutov

Before joining Apple 4.5 years ago, Tim worked as software engineer at Fermi National Accelerator Laboratory as one of the key contributors to dCache( https://www.dcache.org/about/) Mass Storage Sy...
Read more