V-FAT: Benchmarking Visual Fidelity Against Text-bias
- URL: http://arxiv.org/abs/2601.04897v1
- Date: Thu, 08 Jan 2026 12:50:14 GMT
- Title: V-FAT: Benchmarking Visual Fidelity Against Text-bias
- Authors: Ziteng Wang, Yujie He, Guanliang Li, Siqi Yang, Jiaqi Xiong, Songxiang Liu,
- Abstract summary: We investigate the tension between visual perception and linguistic priors.<n>We introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains.<n>Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
- Score: 10.716447149075357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
Related papers
- Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning [79.95774256444956]
The lack of reasoning capabilities in Vision-Language Models has remained at the forefront of research discourse.<n>We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics.
arXiv Detail & Related papers (2026-02-26T18:54:06Z) - Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy [59.44168425139687]
BayesVLA is a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify.<n>Experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods.
arXiv Detail & Related papers (2025-12-12T01:59:23Z) - Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis [19.111897718147656]
Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data.<n>We propose that the bias originates from the model's internal architecture.
arXiv Detail & Related papers (2025-10-30T17:22:22Z) - BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z) - Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection [49.26064449816502]
We propose a Gradient-based Influence-Aware Constrained Decoding (GACD) method to address text-visual bias and co-occurrence bias.<n>GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
arXiv Detail & Related papers (2025-09-03T08:13:52Z) - A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [58.32070787537946]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z) - Words or Vision: Do Vision-Language Models Have Blind Faith in Text? [34.88114876390461]
Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks.<n>We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings.<n>We discover a emphblind faith in text'' phenomenon:VLMs disproportionately trust textual data over visual data when inconsistencies arise.
arXiv Detail & Related papers (2025-03-04T02:21:07Z) - Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection [18.625071242029936]
Large Language Models (LLMs) have been shown to exhibit various biases and stereotypes in their generated content.<n>This paper presents a systematic framework to investigate and compare explicit and implicit biases in LLMs.
arXiv Detail & Related papers (2025-01-04T14:08:52Z) - Covert Bias: The Severity of Social Views' Unalignment in Language Models Towards Implicit and Explicit Opinion [0.40964539027092917]
We evaluate the severity of bias toward a view by using a biased model in edge cases of excessive bias scenarios.
Our findings reveal a discrepancy in LLM performance in identifying implicit and explicit opinions, with a general tendency of bias toward explicit opinions of opposing stances.
The direct, incautious responses of the unaligned models suggest a need for further refinement of decisiveness.
arXiv Detail & Related papers (2024-08-15T15:23:00Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.