DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment
- URL: http://arxiv.org/abs/2507.12796v1
- Date: Thu, 17 Jul 2025 05:23:53 GMT
- Title: DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment
- Authors: Junjie Gao, Runze Liu, Yingzhe Peng, Shujian Yang, Jin Zhang, Kai Yang, Zhiyuan You,
- Abstract summary: We adapt DeQA-Score, a state-of-the-art MLLM-based image quality scorer, for document quality assessment.<n>We propose DeQA-Doc, a framework that leverages the visual language capabilities of MLLMs and a soft label strategy to regress continuous document quality scores.
- Score: 6.922942482129033
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document quality assessment is critical for a wide range of applications including document digitization, OCR, and archival. However, existing approaches often struggle to provide accurate and robust quality scores, limiting their applicability in practical scenarios. With the rapid progress in Multi-modal Large Language Models (MLLMs), recent MLLM-based methods have achieved remarkable performance in image quality assessment. In this work, we extend this success to the document domain by adapting DeQA-Score, a state-of-the-art MLLM-based image quality scorer, for document quality assessment. We propose DeQA-Doc, a framework that leverages the visual language capabilities of MLLMs and a soft label strategy to regress continuous document quality scores. To adapt DeQA-Score to DeQA-Doc, we adopt two complementary solutions to construct soft labels without the variance information. Also, we relax the resolution constrains to support the large resolution of document images. Finally, we introduce ensemble methods to further enhance the performance. Extensive experiments demonstrate that DeQA-Doc significantly outperforms existing baselines, offering accurate and generalizable document quality assessment across diverse degradation types. Codes and model weights are available in https://github.com/Junjie-Gao19/DeQA-Doc.
Related papers
- Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models [19.598563198222035]
We propose Q-Doc to systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels.<n>We show that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment.<n>Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement.
arXiv Detail & Related papers (2025-11-14T15:41:17Z) - UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z) - DocIQ: A Benchmark Dataset and Feature Fusion Network for Document Image Quality Assessment [78.21680156380705]
We introduce a subjective DIQA dataset DIQA-5000.<n>The DIQA-5000 dataset comprises 5,000 document images, generated by applying multiple document enhancement techniques to 500 real-world images with diverse distortions.<n>Each enhanced image was rated by 15 subjects across three rating dimensions: overall quality, sharpness, and color fidelity.<n>We propose a specialized no-reference DIQA model that exploits document layout features to maintain quality perception at reduced resolutions to lower computational cost.
arXiv Detail & Related papers (2025-09-21T10:01:43Z) - Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z) - Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering [42.468210353582755]
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents.<n>Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches.<n>We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains.
arXiv Detail & Related papers (2025-05-22T09:52:57Z) - WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild? [64.62909376834601]
This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments.<n> evaluation of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks.
arXiv Detail & Related papers (2025-05-16T09:09:46Z) - M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.<n>Existing document understanding benchmarks often assess LVLMs using question-answer formats.<n>We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)<n>M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z) - Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - HiQA: A Hierarchical Contextual Augmentation RAG for Multi-Documents QA [13.000411428297813]
We present HiQA, an advanced multi-document question-answering (MDQA) framework that integrates cascading metadata into content and a multi-route retrieval mechanism.
We also release a benchmark called MasQA to evaluate and research in MDQA.
arXiv Detail & Related papers (2024-02-01T02:24:15Z) - A Multi-Document Coverage Reward for RELAXed Multi-Document
Summarization [11.02198476454955]
We propose fine-tuning an MDS baseline with a reward that balances a reference-based metric with coverage of the input documents.
Experimental results over the Multi-News and WCEP MDS datasets show significant improvements of up to +0.95 pp average ROUGE score and +3.17 pp METEOR score over the baseline.
arXiv Detail & Related papers (2022-03-06T07:33:01Z) - Recognition-free Question Answering on Handwritten Document Collections [3.0969191504482247]
We present a recognition-free Question Answering approach for handwritten documents.
Our approaches outperform the state-of-the-art recognition-free models on the challenging BenthamQA and HW-SQuAD datasets.
arXiv Detail & Related papers (2022-02-12T14:47:44Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.