Multi-Stage Field Extraction of Financial Documents with OCR and Compact Vision-Language Models
- URL: http://arxiv.org/abs/2510.23066v1
- Date: Mon, 27 Oct 2025 06:56:08 GMT
- Title: Multi-Stage Field Extraction of Financial Documents with OCR and Compact Vision-Language Models
- Authors: Yichao Jin, Yushuo Wang, Qishuai Zhong, Kent Chiu Jin-Chun, Kenneth Zhu Ke, Donald MacDonald,
- Abstract summary: Financial documents are essential sources of information for regulators, auditors, and financial institutions.<n>These documents tend to be heterogeneous, mixing narratives, tables, figures, and multilingual content within the same report.<n>We propose a multistage pipeline that leverages traditional image processing models and OCR extraction, together with compact VLMs for structured field extraction.
- Score: 2.6300820904868263
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Financial documents are essential sources of information for regulators, auditors, and financial institutions, particularly for assessing the wealth and compliance of Small and Medium-sized Businesses. However, SMB documents are often difficult to parse. They are rarely born digital and instead are distributed as scanned images that are none machine readable. The scans themselves are low in resolution, affected by skew or rotation, and often contain noisy backgrounds. These documents also tend to be heterogeneous, mixing narratives, tables, figures, and multilingual content within the same report. Such characteristics pose major challenges for automated information extraction, especially when relying on end to end large Vision Language Models, which are computationally expensive, sensitive to noise, and slow when applied to files with hundreds of pages. We propose a multistage pipeline that leverages traditional image processing models and OCR extraction, together with compact VLMs for structured field extraction of large-scale financial documents. Our approach begins with image pre-processing, including segmentation, orientation detection, and size normalization. Multilingual OCR is then applied to recover page-level text. Upon analyzing the text information, pages are retrieved for coherent sections. Finally, compact VLMs are operated within these narrowed-down scopes to extract structured financial indicators. Our approach is evaluated using an internal corpus of multi-lingual, scanned financial documents. The results demonstrate that compact VLMs, together with a multistage pipeline, achieves 8.8 times higher field level accuracy relative to directly feeding the whole document into large VLMs, only at 0.7 percent of the GPU cost and 92.6 percent less end-to-end service latency.
Related papers
- Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z) - Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation [47.714317480436215]
PREMIR is a simple framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval.<n> Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings.
arXiv Detail & Related papers (2025-08-23T16:14:41Z) - Digitization of Document and Information Extraction using OCR [0.0]
This document presents a framework for text extraction that merges Optical Character Recognition (OCR) techniques with Large Language Models (LLMs)<n>Scanned files are processed using OCR engines, while digital files are interpreted through layout-aware libraries.<n>The extracted raw text is then analyzed by an LLM to identify key-value pairs and resolve ambiguities.
arXiv Detail & Related papers (2025-06-11T16:03:01Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.<n>It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.<n>We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z) - PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - Arctic-TILT. Business Document Understanding at Sub-Billion Scale [1.2286461468814107]
We introduce the Arctic-TILT achieving accuracy on par with models 1000$times$ its size on these use cases.
It can be fine-tuned and deployed on a single 24GB GPU, lowering operational costs while processing Visually Rich Documents with up to 400k tokens.
The model establishes state-of-the-art results on seven diverse Understanding Document benchmarks, as well as provides reliable confidence scores and quick inference.
arXiv Detail & Related papers (2024-08-08T17:59:46Z) - Drilling Down into the Discourse Structure with LLMs for Long Document
Question Answering [5.022057415488129]
We propose a suite of techniques that exploit the discourse structure commonly found in documents.
We show how our approach can be combined with textitself-ask reasoning agent to achieve best zero-shot performance in complex multi-hop question answering.
arXiv Detail & Related papers (2023-11-22T18:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.