Related papers: Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models

Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models

URL: http://arxiv.org/abs/2511.11410v1
Date: Fri, 14 Nov 2025 15:41:17 GMT
Title: Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models
Authors: Jiaxi Huang, Dongxu Wu, Hanwei Zhu, Lingyu Zhu, Jun Xing, Xu Wang, Baoliang Chen,
Abstract summary: We propose Q-Doc to systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels.<n>We show that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment.<n>Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement.
Score: 19.598563198222035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.

Related papers

Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework [78.58395822978271]
LEAF is a Label-Efficient Image Quality Assessment Framework.<n>It distills perceptual quality priors from an MLLM teacher into a lightweight student regressor.<n>Our method significantly reduces the need for human annotations while maintaining strong MOS-aligned correlations.
arXiv Detail & Related papers (2026-01-28T15:15:17Z)
Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment [7.969076042774561]
We introduce a low-level distortion perception task that requires models to classify specific distortion types.<n>Our analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates.<n>We show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%.
arXiv Detail & Related papers (2025-12-10T12:06:47Z)
Revisiting MLLM Based Image Quality Assessment: Errors and Remedy [23.918454005000328]
A key challenge arises from the inherent mismatch between the discrete token outputs of MLLMs and the continuous nature of quality scores required by IQA tasks.<n>We propose Q-Scorer, which incorporates a lightweight regression module and IQA-specific score tokens into the MLLM pipeline.<n>Q-Scorer achieves state-of-the-art performance across multiple IQA benchmarks, generalizes well to mixed datasets, and further improves when combined with other methods.
arXiv Detail & Related papers (2025-11-11T04:08:44Z)
DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment [6.922942482129033]
We adapt DeQA-Score, a state-of-the-art MLLM-based image quality scorer, for document quality assessment.<n>We propose DeQA-Doc, a framework that leverages the visual language capabilities of MLLMs and a soft label strategy to regress continuous document quality scores.
arXiv Detail & Related papers (2025-07-17T05:23:53Z)
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z)
Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models [67.89204055004028]
Large Vision-Language Models (LVLMs) have been plagued by the issue of hallucination. Previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics. We propose a Hallucination benchmark Quality Measurement framework (HQM) to assess the reliability and validity of existing hallucination benchmarks.
arXiv Detail & Related papers (2024-06-24T20:08:07Z)
Q-Boost: On Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models [80.79438689784958]
We introduce Q-Boost, a strategy designed to enhance low-level MLLMs in image quality assessment (IQA) and video quality assessment (VQA) tasks. Q-Boost innovates by incorporating a middle ground' approach through $neutral$ prompts, allowing for a more balanced and detailed assessment. The experimental results show that the low-level MLLMs exhibit outstanding zeros-shot performance on the IQA/VQA tasks equipped with the Q-Boost strategy.
arXiv Detail & Related papers (2023-12-23T17:02:25Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models. We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.