Efficient Spark Analytics on Encrypted Data

Download Slides

Enterprises and non-profit organizations often work with sensitive business or personal information, that must be stored in an encrypted form due to corporate confidentiality requirements, the new GDPR regulations, and other reasons. Unfortunately, a straightforward encryption doesn’t work well for modern columnar data formats, such as Apache Parquet, that are leveraged by Spark for acceleration of data ingest and processing. When Parquet files are bulk-encrypted at the storage, their internal modules can’t be extracted, leading to a loss of column / row filtering capabilities and a significant slowdown of Spark workloads.

Existing solutions suffer from either performance or security drawbacks. We work with the Apache Parquet community on a new modular encryption mechanism, that enables full columnar projection and predicate push down (filtering) functionality on encrypted data in any storage system. Besides confidentiality, the mechanism supports data authentication, where the reader can verify a file has not been tampered with or replaced with a wrong version. Different columns can be encrypted with different keys, allowing for a fine grained access control.

In this talk, I will demonstrate Spark integration with the Parquet modular encryption mechanism, running efficient analytics directly on encrypted data. The demonstration scenarios are derived from use cases in our joint research project with a number of European companies, working with sensitive data such as connected car messages (location, speed, driver identity, etc). I will describe the encryption mechanism, and the observed performance implications of encrypting and decrypting data in Spark SQL workloads.

Session hashtag: #SAISDev14



« back
Gidon Gershinsky
About Gidon Gershinsky

Gidon Gershinsky designs and builds Data Security solutions at Apple. He plays a leading role in the Apache Parquet community work on big data encryption and integrity verification technologies. He's earned a PhD degree at the Weizmann Institute of Science in Israel, and was a post-doctoral fellow at Columbia University in New York City.