I am currently a Senior Enterprise Architect and former member of Elsevier Labs and Advanced Technology Groups, the R&D arms of Reed Elsevier. I now work on projects in big data, natural language processing and search engine design and application. Earlier in my career I co-developed a 3D brain simulator that was featured on the NPR show All Things Considered, I published an IETF Draft Standard for firewall content filtering, and I co-authored a patent for work at Alta Vista. In academia, I have been teacher/lecturer at universities in the USA and in Tunisia as a Peace Corps Volunteer.
Accessing AWS Services from a Spark program requires authentication credentials that, when improperly managed, seriously threaten system security. Spark clusters engage 10's, 100's or even 1000's of machines, and managing authentication credentials across clusters can be very complicated. This complexity increases for systems that scale dynamically and for systems that make use of opportunistic scheduling strategies. For example, how would you disable and then re-issue resource credentials on a cluster of 10s or 100s of machines, all without restarting your application? How would you manage credentials embedded within an application or a sealed image without re-issuing the application? Consider the following very real scenarios: 1. Multiple users access your database and you want to periodically rotate credentials. 2. Your compiled application requires AWS credentials to post events to AWS topics and to read events from AWS queues. The credentials are either embedded within the application or are read from environment variables. 3. One of your AWS EC2 instances within a Spark cluster has come under attack and you are concerned that your security credentials might have been compromised. 4. You want to provide different levels of access to different users. 5. You want security credentials to automatically expire after a certain period. This paper explains these and other common security challenges and then shows safe techniques that reduce and/or eliminate security risk. While the solutions described are specific to Spark running in AWS, the principles described are universally suitable for big data applications on all platforms. Session hashtag: #EUent2