Prabhakar Gouda works as a senior software engineer at Informatica with 8+ year of experience in software development and design techniques. He is part of the Informatica core Engineering team that works on various Informatica’s Big Data processing platforms which leverages the power of Big Data processing and management using Apache Spark, Hadoop, YARN, Hbase, HDFS and Hive. Recently he worked on data visualization project that utilizes Spark-jobserver to submit and manage the spark tasks on to the Hadoop cluster.
June 24, 2020 05:00 PM PT
As you may already know, the open-source Spark Job Server offers a powerful platform for managing Spark jobs, jars, and contexts, turning Spark into a much more convenient and easy-to-use service. The Spark-Jobserver can keep Spark context warmed up and readily available for accepting new jobs. At Informatica we are leveraging the Spark-Jobserver offerings to solve the data-visualization use-case. Data-visualization of hierarchical data in a Big data pipeline requires executing Spark-jobs on Hadoop clusters. The Spark context reuse helped us to achieve the faster spark tast execution in combination with the jobserver. We integrated Spark-Jobserver by using its REST APIs to create and manage the life-cycle of Spark contexts. Our product combines the customer's data pipeline logic into a JAR and submits it to the Spark-Jobserver using the API.
Afterwards Spark-Jobserver maintains the first context warmed up and submit subsequent jobs to the same context, which allows quicker execution because the time spent will be only used for running the customer's domain logic and not in resource allocations or other boilerplate infrastructure work. Our production use-cases require parallel job execution and job monitoring which is readily provided by Spark-Jobserver on account of its smooth integration with Hadoop's Job History server. We were introduced and adopted Spark-Jobserver through this conference and community and would like to pay it forward by talking about our journey adopting it in our data-integration product. The key takeaways will be the major configuration touch points for using Spark-Jobserver with YARN cluster mode, how we dealt with secure/SSL-enabled Yarn clusters. We'll continue with multiple Spark-Jobserver instance, managing jobs on same/different cluster, concurrent job execution and the APIs for resolving resource for dependencies.