FlowSpec—Apache Spark Pipelines in Production – Databricks

FlowSpec—Apache Spark Pipelines in Production

Download Slides

One of the key problems that we have been facing as a team that works with big data and data science is, moving machine learning to production without investing an undue amount of time in review/rework/reimplementation. A clear separation of responsibilities between data scientists, platform and data engineers has also been tough to achieve in this landscape given how rapidly spark evolves as a technology. Another management concern was about the different skills and capabilities that individual data scientists in different departments came with. An interesting way to handle this situation has been through the use of Spark pipelines in Danske bank. The data scientists in the organization use spark pipelines as tools to create uniformity in the features they generate and streamline the modelling process. We began exploring why we shouldn’t use it as a mode to deliver code to production too.

The talk will focus on how a simple prototype tool, FlowSpec, which took a couple of weeks to develop, helped reduce time to market for models, ensure data quality, created fair and clear separation of duties and offers a consolidated solution to recurrent problem scenarios in the arduous process of moving ml models from different teams and departments in a large organization to production. Some of the unforeseen, nevertheless, interesting benefits of the approach were that we were able to easily visualize data flows for compliance projects based on GDPR, resolve dependencies for data flows automatically and centralize performance best practices into a tool that was maintained by one team.

Key takeaway: 1. Our experience with a new approach to moving models rapidly from development to production (matter of weeks). 2. Demo of a python/jupyter based tool that could potentially be made open source to run/visualize data flows that cater to the needs of production environments.

Session hashtag: #SAISML3

« back
About Subramaniam Ramasubramanian

Originally from Chennai with a software engineering background and a masters degree in security and mobile computing. Worked with data warehousing and building ETL systems since 2010. Nurturing the hacker data scientist image in my current organization since 2015 where I have also been an enthusiastic crusader for open source and spark. I am a strongly believe in quick decision making and agile processes with many tools and prototypes to my credit within the organization.