Session

PDF Document Ingestion Accelerator for GenAI Applications

Overview

Experience	In Person
Type	Breakout
Track	Data Engineering and Streaming
Industry	Financial Services
Technologies	Apache Spark, Databricks Workflows
Skill Level	Intermediate
Duration	40 min

Databricks Financial Service customers in the GenAI space have a common use case of ingestion and processing of unstructured documents — PDF/images — then performing downstream GenAI tasks such as entity extraction and RAG based knowledge Q&A.

The pain points for the customers for these types of use cases are:

The quality of the PDF/image documents varies since many older physical documents were scanned into electronic form
The complexity of the PDF/image documents varies and many contain tables — images with embedding information — which require slower Tesseract OCR
They would like to streamline postprocess for downstream workloads

In this talk we will present an optimized structured streaming workflow for complex PDF ingestion. The key techniques include Apache Spark™ optimization, multi-threading, PDF object extraction, skew handling and auto retry logics

Session Speakers

IMAGE COMING SOON

Jas Bali

/Lead SSA
Databricks

IMAGE COMING SOON

Qian Yu

/Sr. Specialized Solutions Engineer
Databricks