Marcelo is a Software Engineer at Cloudera and a contributor to the Apache Spark project.
Apache Hive has become de facto standard SQL on big data in Hadoop ecosystem. With its open architecture and backend neutrality, Hive queries can run on MapReduce and Tez. On the other hand, Apache Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Powering Hive with Spark, that is, introducing Spark as a new execution engine to Hive, has many benefits for both Spark users and Hive users. Hive on Spark (HIVE-7292) is probably the most watched project in Hive with 130+ watchers. The effort has attracted developers from both communities, around the globe, and from brand companies such as Intel, IBM, Cloudera, and MapR. This presentation covers the motivation, design principles, and architecture of the approach, with an emphasis on technical challenges that are posed to both Spark and Hive, such as YARN integration, resource scaling, user session management, etc. as well as the approaches we take and the tradeoffs we make to overcome these challenges. The presentation concludes with a status update of the project followed by a live demo.
YARN is becoming a popular way to deploy Spark applications. In this presentation we’ll explore some of the existing features that make YARN a popular choice, and talk about some of the future work to make it even easier to deploy Spark applications on YARN. This includes enhancements to dynamic allocation such as supporting better data locality, all-encompassing security and support for long-running applications on secure environments, which is particularly important for Spark Streaming.
With Spark being used for more and more production workloads with stringent security requirements, fully locking down Spark applications has become critical. In this talk you will learn the different aspects of securing your spark application. First, we will describe how Spark utilizes kerberos for application authentication. Next, we will discuss the protection of sensitive data through on disk and on the wire encryption. Finally, when integrating with external datastore, authorization becomes important to control who can access what data. We will conclude with discussing open challenges and future work to improve overall security.