Grounded Reinforcement Learning for Visual Reasoning
- URL: http://arxiv.org/abs/2505.23678v2
- Date: Mon, 20 Oct 2025 14:54:22 GMT
- Title: Grounded Reinforcement Learning for Visual Reasoning
- Authors: Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki,
- Abstract summary: We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with reinforcement learning.<n>Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces.<n>Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.
- Score: 51.94871616778874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.
Related papers
- When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning [108.73849507002195]
We present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning.<n>We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency.
arXiv Detail & Related papers (2026-02-09T03:21:48Z) - From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning [19.84653798433995]
We propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself.<n>ViRL integrates (1) Process Supervision with ground-truth rationales, (2) Objective Alignment via step-level reward shaping, and (3) Fine-Grained Credit Assignment to distinguish correct, redundant, and erroneous actions.<n>This work establishes visual rationalization as a task-agnostic, process-grounded paradigm for building transparent, verifiable, and trustworthy vision-language models.
arXiv Detail & Related papers (2025-11-28T09:52:56Z) - BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z) - More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models [17.431298099935344]
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs)<n>Recent research has sought to extend reasoning to Vision-Language Models (VLMs)<n>Our study uncovers the dual nature of multimodal reasoning, leading to recognition failures on otherwise basic visual questions.<n>We propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories.
arXiv Detail & Related papers (2025-09-30T06:37:47Z) - Visual Jigsaw Post-Training Improves MLLMs [58.29961336087896]
We introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in large language models (MLLMs)<n>Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language.<n>Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding.
arXiv Detail & Related papers (2025-09-29T17:59:57Z) - Reinforced Visual Perception with Tools [66.79840157663237]
We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools.<n>We show that our method achieves state-of-the-art performance on several perception-heavy benchmarks.<n>Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench.
arXiv Detail & Related papers (2025-09-01T17:57:49Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes [51.895756593200296]
Deep Inspection and Perception with RL (DIP-R1) is designed to enhance the visual perception capabilities of MLLMs.<n>DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modelings.<n>It achieves consistent and significant improvement across various in-domain and out-of-domain scenarios.
arXiv Detail & Related papers (2025-05-29T07:16:16Z) - VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [29.524164786422368]
Recently, DeepSeek R1 has shown that reinforcement learning can substantially improve the reasoning capabilities of Large Language Models (LLMs)<n>We investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs)<n>We develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks.
arXiv Detail & Related papers (2025-04-10T10:05:15Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning [3.8309622155866583]
We introduce the Sliding Puzzles Gym (SPGym), a novel benchmark that transforms the classic 8-tile puzzle into a visual reinforcement learning task with images drawn from arbitrarily large datasets.<n>SPGym's key innovation lies in its ability to precisely control representation learning complexity through adjustable grid sizes and image pools.
arXiv Detail & Related papers (2024-10-17T21:23:03Z) - ViSaRL: Visual Reinforcement Learning Guided by Human Saliency [6.969098096933547]
We introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL)
Using ViSaRL to learn visual representations significantly improves the success rate, sample efficiency, and generalization of an RL agent.
We show that visual representations learned using ViSaRL are robust to various sources of visual perturbations including perceptual noise and scene variations.
arXiv Detail & Related papers (2024-03-16T14:52:26Z) - ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling [35.098725056881655]
Large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities.
The generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements.
We introduce a novel framework, ViGoR, that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines.
arXiv Detail & Related papers (2024-02-09T01:00:14Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z) - INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL [90.06845886194235]
We propose a modified objective for model-based reinforcement learning (RL)
We integrate a term inspired by variational empowerment into a state-space model based on mutual information.
We evaluate the approach on a suite of vision-based robot control tasks with natural video backgrounds.
arXiv Detail & Related papers (2022-04-18T23:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.