Unbiased Visual Reasoning with Controlled Visual Inputs
- URL: http://arxiv.org/abs/2512.22183v1
- Date: Fri, 19 Dec 2025 18:52:06 GMT
- Title: Unbiased Visual Reasoning with Controlled Visual Inputs
- Authors: Zhaonan Li, Shijie Lu, Fei Wang, Jacob Dineen, Xiao Ye, Zhikun Xu, Siyi Liu, Young Min Cho, Bangzheng Li, Daniel Chang, Kenny Nguyen, Qizheng Yang, Muhao Chen, Ben Zhou,
- Abstract summary: VISTA is a framework that decouples perception from reasoning via an explicit information bottleneck.<n>A frozen VLM sensor is restricted to short, objective perception queries.<n>A text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language.
- Score: 28.155426761798022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), while remaining competitive on MMVP and a balanced SeedBench subset. VISTA transfers robustly across unseen VLM sensors and is able to recognize and recover from VLM perception failures. Human analysis further shows that VISTA's reasoning traces are more neutral, less reliant on spurious attributes, and more explicitly grounded in visual evidence than end-to-end VLM baselines.
Related papers
- ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z) - More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models [74.10138874771852]
We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR.<n>For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency.<n>For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision.
arXiv Detail & Related papers (2025-12-13T23:06:18Z) - Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs [76.47326680870783]
We introduce VISTA-Gym, a training environment for incentivizing tool-integrated visual reasoning capabilities in vision-language models (VLMs)<n> VISTA-Gym unifies diverse real-world multimodal reasoning tasks with a standardized interface for visual tools.<n>We show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72% in 11 public reasoning-intensive VQA benchmarks.
arXiv Detail & Related papers (2025-11-24T22:58:26Z) - Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens [54.18057944158818]
Chain-of-Visual-Thought (COVT) is a framework that enables Vision-Language Models (VLMs) to reason through continuous visual tokens.<n>Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts.<n>During training, the VLM with COVT autoregressively predicts visual tokens to reconstruct dense supervision signals.
arXiv Detail & Related papers (2025-11-24T18:55:19Z) - MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs [22.99984702966184]
We introduce MVI-Bench, the first comprehensive benchmark for evaluating how Misleading Visual Inputs undermine the robustness of Large Vision-Language Models (LVLMs)<n>MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship.<n>MVI-Sensitivity is a novel metric that characterizes LVLM robustness at a granular level.
arXiv Detail & Related papers (2025-11-18T05:48:08Z) - CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding [1.6257248483123767]
We present textbfCoRGI(textbfChain textbfof textbfReasoning with textbfGrounded textbfInsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs.
arXiv Detail & Related papers (2025-08-01T07:17:12Z) - Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z) - Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs [9.406760867809124]
This paper introduces Visual Input Structure for Enhanced Reasoning (VISER)<n>VISER is a simple, effective method that augments visual inputs with low-level spatial structures.<n>We empirically demonstrate substantial performance improvements across core visual reasoning tasks.
arXiv Detail & Related papers (2025-06-27T11:44:40Z) - PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models [17.522361689805724]
Vision language models (VLMs) respond to user-crafted text prompts and visual inputs.<n>It is crucial to determine whether VLMs inherit this instability to varying prompts.<n>We introduce PARC (Prompt Analysis via Reliability and agnostic), a VLM prompt sensitivity analysis framework.
arXiv Detail & Related papers (2025-06-03T19:42:32Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis [6.704529554100875]
Vision-language models (VLMs) have shown impressive zero- and few-shot performance on real-world visual question answering benchmarks.
It remains unclear whether a VLM's apparent visual reasoning performance is due to its world knowledge, or due to actual visual reasoning capabilities.
arXiv Detail & Related papers (2024-08-27T14:43:54Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.