A Hybrid Architecture for Multi-Stage Claim Document Understanding: Combining Vision-Language Models and Machine Learning for Real-Time Processing
- URL: http://arxiv.org/abs/2601.01897v1
- Date: Mon, 05 Jan 2026 08:40:44 GMT
- Title: A Hybrid Architecture for Multi-Stage Claim Document Understanding: Combining Vision-Language Models and Machine Learning for Real-Time Processing
- Authors: Lilu Cheng, Jingjun Lu, Yi Xuan Chan, Quoc Khai Nguyen, John Bi, Sean Ho,
- Abstract summary: Claims documents are fundamental to healthcare and insurance operations, serving as the basis for reimbursement, auditing, and compliance.<n>This paper presents a robust multi-stage pipeline that integrates the multilingual optical character recognition (OCR) engine PaddleOCR, a traditional Logistic Regression, and a compact Vision-Language Model (VLM), Qwen 2.5-VL-7B.<n>The proposed system achieves a document-type classification accuracy of over 95 percent and a field-level extraction accuracy of approximately 87 percent, while maintaining an average processing latency of under 2 seconds per document.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Claims documents are fundamental to healthcare and insurance operations, serving as the basis for reimbursement, auditing, and compliance. However, these documents are typically not born digital; they often exist as scanned PDFs or photographs captured under uncontrolled conditions. Consequently, they exhibit significant content heterogeneity, ranging from typed invoices to handwritten medical reports, as well as linguistic diversity. This challenge is exemplified by operations at Fullerton Health, which handles tens of millions of claims annually across nine markets, including Singapore, the Philippines, Indonesia, Malaysia, Mainland China, Hong Kong, Vietnam, Papua New Guinea, and Cambodia. Such variability, coupled with inconsistent image quality and diverse layouts, poses a significant obstacle to automated parsing and structured information extraction. This paper presents a robust multi-stage pipeline that integrates the multilingual optical character recognition (OCR) engine PaddleOCR, a traditional Logistic Regression classifier, and a compact Vision-Language Model (VLM), Qwen 2.5-VL-7B, to achieve efficient and accurate field extraction from large-scale claims data. The proposed system achieves a document-type classification accuracy of over 95 percent and a field-level extraction accuracy of approximately 87 percent, while maintaining an average processing latency of under 2 seconds per document. Compared to manual processing, which typically requires around 10 minutes per claim, our system delivers a 300x improvement in efficiency. These results demonstrate that combining traditional machine learning models with modern VLMs enables production-grade accuracy and speed for real-world automation. The solution has been successfully deployed in our mobile application and is currently processing tens of thousands of claims weekly from Vietnam and Singapore.
Related papers
- MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding [7.650139800950797]
MosaicDoc is a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of Visually Rich Document Understanding (VRDU)<n>With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field.<n>Our evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity.
arXiv Detail & Related papers (2025-11-13T03:34:44Z) - Multi-Stage Field Extraction of Financial Documents with OCR and Compact Vision-Language Models [2.6300820904868263]
Financial documents are essential sources of information for regulators, auditors, and financial institutions.<n>These documents tend to be heterogeneous, mixing narratives, tables, figures, and multilingual content within the same report.<n>We propose a multistage pipeline that leverages traditional image processing models and OCR extraction, together with compact VLMs for structured field extraction.
arXiv Detail & Related papers (2025-10-27T06:56:08Z) - UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z) - Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu [0.0]
Manchu, a critically endangered language, lacks effective OCR systems that can handle real-world historical documents.<n>This study develops high-performing OCR systems by fine-tuning three open-source vision-language models.<n>LLaMA-3.2-11B achieved exceptional performance with 98.3% word accuracy and 0.0024 character error rate on synthetic data.
arXiv Detail & Related papers (2025-07-09T11:38:20Z) - WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild? [64.62909376834601]
This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments.<n> evaluation of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks.
arXiv Detail & Related papers (2025-05-16T09:09:46Z) - A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports [0.3552186988607578]
This study presents an open-source pipeline that extracts and categorizes checkbox data from scanned documents.<n>The pipeline achieves high precision and recall compared against annually compiled gold-standards from 2017 to 2024.
arXiv Detail & Related papers (2025-04-28T19:40:28Z) - Harnessing PDF Data for Improving Japanese Large Multimodal Models [56.80385809059738]
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited.<n>Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge.<n>We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs.
arXiv Detail & Related papers (2025-02-20T17:59:59Z) - DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection [71.97939405401961]
We introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from Common Crawl data and existing multilingual sources.<n>DCAD-2000 covers 2,282 languages, 46.72TB of text, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts.<n>By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance.
arXiv Detail & Related papers (2025-02-17T08:28:29Z) - PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data.
Such mandates require effort both in terms of data as well as model retraining.
continuous removal of data and model retraining steps do not scale.
We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z) - Families In Wild Multimedia: A Multimodal Database for Recognizing
Kinship [63.27052967981546]
We introduce the first publicly available multi-task MM kinship dataset.
To build FIW MM, we developed machinery to automatically collect, annotate, and prepare the data.
Results highlight edge cases to inspire future research with different areas of improvement.
arXiv Detail & Related papers (2020-07-28T22:36:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.