Sailee Jain - Databricks

Sailee Jain

Senior Software Engineer, Informatica

Sailee is a Senior Software Engineer in Informatica’s Big Data Management team focused on data integration solutions in Hadoop environment. She recently worked on the data visualization project which allows previewing hierarchical data flowing through various transformations in a Spark data pipeline. Sailee received her Master’s in Computer Science from Indian Institute of Technology Bombay. She is currently working on Informatica’s Elastic Cloud data integration project.

UPCOMING SESSIONS

Faster Data Integration Pipeline Execution using Spark-JobserverSummit 2020

As you may already know, the open-source Spark Job Server offers a powerful platform for managing Spark jobs, jars, and contexts, turning Spark into a much more convenient and easy-to-use service. The Spark-Jobserver can keep Spark context warmed up and readily available for accepting new jobs. At Informatica we are leveraging the Spark-Jobserver offerings to solve the data-visualization use-case. Data-visualization of hierarchical data in a Big data pipeline requires executing Spark-jobs on Hadoop clusters. The Spark context reuse helped us to achieve the faster spark tast execution in combination with the jobserver. We integrated Spark-Jobserver by using its REST APIs to create and manage the life-cycle of Spark contexts. Our product combines the customer's data pipeline logic into a JAR and submits it to the Spark-Jobserver using the API.

Afterwards Spark-Jobserver maintains the first context warmed up and submit subsequent jobs to the same context, which allows quicker execution because the time spent will be only used for running the customer's domain logic and not in resource allocations or other boilerplate infrastructure work. Our production use-cases require parallel job execution and job monitoring which is readily provided by Spark-Jobserver on account of its smooth integration with Hadoop's Job History server. We were introduced and adopted Spark-Jobserver through this conference and community and would like to pay it forward by talking about our journey adopting it in our data-integration product. The key takeaways will be the major configuration touch points for using Spark-Jobserver with YARN cluster mode, how we dealt with secure/SSL-enabled Yarn clusters. We'll continue with multiple Spark-Jobserver instance, managing jobs on same/different cluster, concurrent job execution and the APIs for resolving resource for dependencies.

PAST SESSIONS