Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis
- URL: http://arxiv.org/abs/2511.21397v1
- Date: Wed, 26 Nov 2025 13:49:08 GMT
- Title: Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis
- Authors: Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee,
- Abstract summary: We investigate how visual distractors affect test-time scaling in vision-language models.<n>We find that visual distractors differ fundamentally from textual ones.<n>We propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
- Score: 17.897469162097085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
Related papers
- VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models [21.438802784706994]
We propose VisRef, a visually grounded test-time scaling framework.<n>Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens.<n>Under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
arXiv Detail & Related papers (2026-02-27T11:48:19Z) - Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis [61.597286699809395]
We introduce textbfTemporal Attention Pattern Predictability Analysis (TAPPA)<n>TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random.<n>We provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE)
arXiv Detail & Related papers (2026-01-29T13:40:23Z) - Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z) - BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z) - Unleashing Perception-Time Scaling to Multimodal Reasoning Models [60.578179197783754]
Recent advances in inference-time scaling have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs)<n>Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear.<n>We propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems.
arXiv Detail & Related papers (2025-10-10T03:17:52Z) - Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection [49.26064449816502]
We propose a Gradient-based Influence-Aware Constrained Decoding (GACD) method to address text-visual bias and co-occurrence bias.<n>GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
arXiv Detail & Related papers (2025-09-03T08:13:52Z) - More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models [43.465268635499754]
Test-time compute has empowered large language models to generate extended reasoning chains.<n>As generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors.
arXiv Detail & Related papers (2025-05-23T05:08:40Z) - Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models [58.64449765678416]
We introduce landscape of thoughts (LoT) to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset.<n>LoT distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks.<n>We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories.
arXiv Detail & Related papers (2025-03-28T06:09:51Z) - Interpretable Visual Question Answering via Reasoning Supervision [4.76359068115052]
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task.
We propose a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal.
We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase.
arXiv Detail & Related papers (2023-09-07T14:12:31Z) - Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features.
We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors.
Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z) - Object-Centric Diagnosis of Visual Reasoning [118.36750454795428]
This paper presents a systematical object-centric diagnosis of visual reasoning on grounding and robustness.
We develop a diagnostic model, namely Graph Reasoning Machine.
Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module.
arXiv Detail & Related papers (2020-12-21T18:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.