Jason Hale

Data Engineer,

Jason is a technology and sustainability enthusiast with a first class Masters in physics from the University of Exeter. His career began on an analyst grad scheme within the Energy sector, however he very quickly realised he was on the wrong career path when he was working with Excel spreadsheets all day whilst teaching himself Python and SQL in his spare time. Quitting after 5 months, he then went on to join Mars Petcare as a Data Engineer. In the 16 months that he has been working with Mars, he has been on various projects including ELT and API framework design and development, as well as the design and implementation of Gecko: the long term strategy for CCPA compliance.

Past sessions

The increase in consumer data privacy laws brings continuing challenges to data teams all over the world which collect, store, and use data protected by these laws. The data engineering team at Mars Petcare is no exception, and in order to improve efficiency and accuracy in responding to these challenges they have built Gecko: an efficient, auditable, and simple CCPA compliance ecosystem designed for Spark and Delta Lake.

Gecko has allowed us to simultaneously achieve the following benefits within our data platform:
- Automatically handle consumer deletion requests in a compliant manner.
- Increase the overall security of PII data in the Petcare Data Platform (PDP) Data Lake.
- Maintain Non-PII data structure, in order to continue to provide analytical value and overall data integrity.
- Make PII data accessible when required.

These benefits have been achieved by a conceptually simple solution: using row (client) level encryption for all PII tables in our system, whilst storing the encryption keys in a single, highly secure location in our lake. By leveraging the power of Spark and Delta Lake, the Gecko ecosystem can carry out a full encryption of all personal data, automatically handle consumer data requests, and decrypt personal data when required for other engineering or analytical projects.

The process has the added benefit of generating a huge labelled training dataset containing all PII in the PDP, for future use in the design of a machine learning model for automatic PII detection. A tool such as this would then enable us to remove the risk of human error when labelling PII on ingestion, as well as enabling PII removal from free text fields.

This presentation will share:
- How the solution can achieve automated privacy rights requests and enhanced platform security.
- How Spark & Delta lake have been leveraged in these applications.
- Why these technologies have been essential in achieving the necessary requirements.

Speakers: Jason Hale and Daniel Harrington