We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs. After converting our datasets to Delta Lake. We will then explore techniques in which we can maximize the cluster utilization by submitting multiple streaming jobs from the driver to run in parallel using scala parallel collections. We’ll discuss techniques to write and implement idempotent tasks that can be parallelized. In conclusion, we will discuss an advanced topic on running a parallel streaming backfill job and the nuances in handling failure and recovery. Demos using databricks notebooks will be shown throughout the presentation.
Got my Masters in Computer Science degree from Indiana University and since then I've worked in the Bay area in tech startups in the fields of finance, retail, and entertainment. Real-time projects excite me the most because its fun to see and interact with systems that respond to a given signal.