Session
PDF Document Ingestion Accelerator for GenAI Applications
Overview
Experience | In Person |
---|---|
Type | Breakout |
Track | Data Engineering and Streaming |
Industry | Financial Services |
Technologies | Apache Spark, Databricks Workflows |
Skill Level | Intermediate |
Duration | 40 min |
Databricks Financial Service customers in the GenAI space have a common use case of ingestion and processing of unstructured documents — PDF/images — then performing downstream GenAI tasks such as entity extraction and RAG based knowledge Q&A.
The pain points for the customers for these types of use cases are:
- The quality of the PDF/image documents varies since many older physical documents were scanned into electronic form
- The complexity of the PDF/image documents varies and many contain tables — images with embedding information — which require slower Tesseract OCR
- They would like to streamline postprocess for downstream workloads
In this talk we will present an optimized structured streaming workflow for complex PDF ingestion. The key techniques include Apache Spark™ optimization, multi-threading, PDF object extraction, skew handling and auto retry logics