Shasidhar is part of Resident Solutions Architects team at Databricks. He is an expert in designing and building batch and streaming applications at scale using Apache Spark. At Databricks he works directly with customers to build. deploy and manage end-to-end spark pipelines in production, also help guide towards Spark best practices. Shashidhar started his Spark journey back in 2014 in Banglore, later he worked as an independed consultant for couple of years and joined Databricks in 2018.
May 26, 2021 11:30 AM PT
Structured Streaming Internals
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
November 18, 2020 04:00 PM PT
Building a curated data lake on real time data is an emerging data warehouse pattern with delta. However in the real world, what we many times face ourselves with is dynamically changing schemas which pose a big challenge to incorporate without downtimes.
In this presentation we will present how we built a robust streaming ETL pipeline that can handle changing schemas and unseen event types with zero downtimes. The pipeline can infer changed schemas, adjust the underlying tables and create new tables and ingestion streams when it detects a new event type. We will show the details how to infer the schemas on the fly and how to track and store these schemas when you don't have the luxury of having a schema registry in the system.
With potentially hundreds of streams, it’s important how we deploy these streams and make them operational on Databricks. We also address this aspect of real-time data pipeline and provide production experience on how this approach performs for ever growing ingestion loads from data providers in both cost and performance.
Speakers: Mate Gulyas and Shasidhar Eranti