Hemshankar Sahu

Principal Software Engineer, Informatica

Hemshankar Sahu is working as a Principal Software Engineer at Informatica. He is working on building a next-gen cloud platform to integrate projects from various teams to minimize the time required in collaboration and reduce the overall production cost. Primary focus is on integrating AI/ML projects developed by Data Science Team with Informatica’s ETL tools/services used by Data Engineering Team.

Past sessions

Summit Europe 2020 Simplifying AI integration on Apache Spark

November 18, 2020 04:00 PM PT

Spark is an ETL and Data Processing engine especially suited for big data. Most of the time an organization has different teams working on different languages, frameworks and libraries, which needs to be integrated in the ETL Pipelines or for general data processing. For example, a Spark ETL job may be written in Scala by data engineering team, but there is a need to integrate a machine learning solution written in python/R developed by Data Science team. These kinds of solutions are not very straightforward to integrate with spark engine, and it required great amount of collaboration between different teams, hence increasing overall project time and cost. Furthermore, these solutions will keep on changing/upgrading with time using latest versions of the technologies and with improved design and implementation, especially in Machine Learning domain where ML models/algorithms keep on improving with new data and new approaches. And so there is significant downtime involved in integrating the these upgraded version.

In this talk we will discuss about how Informatica integrates AI Solutions as part of data processing pipelines executing on top of Spark along with following major features
1. Data Science team can easily share their AI/ML solutions created using any library, language or framework
2. Shared AI/ML solution can be easily consumed in the spark pipeline.
3. Using Informatica products customers can enjoy drag and drop way of creating the Spark Pipeline with the selected solution(s).
4. Various teams can Continuously Integrate and Deploy (CI-CD) different solutions with minimum down time.

In conclusion, we will understand how different teams (like Data Scientist and Data Engineer) can integrated their work together thereby reducing the time/cost consumed in collaboration.

We will also understand how CI/CD is achieved on spark with minimum downtime while integrating various projects specially AI/ML projects using Informatica products.

Thus, by using these features like drag-and-drop way of creating spark pipeline, easy/minimum collaboration between teams and CI-CD, organizations can drastically reduce overall project completion time and cost.

Speaker: Hemshankar Sahu