Related papers: Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

URL: http://arxiv.org/abs/2505.20272v2
Date: Sun, 29 Jun 2025 16:04:49 GMT
Title: Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning
Authors: Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, Xiaodan Liang,
Abstract summary: Ground-R1 is a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations.<n>Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement.
Score: 96.01617809845396
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive general capabilities across a wide range of multi-modal tasks. However, the reasoning processes of LVLMs often suffer from unreliable outputs and limited interpretability. To address this, grounded visual reasoning has emerged as a promising paradigm that enforces responses anchored on salient visual evidence regions. However, existing approaches typically rely on costly supervision such as bounding box annotations, chain-of-thought rationale or external tool calls, limiting their scalability. In this work, we propose Ground-R1, a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations. Ground-R1 consists of a grounding phase that generates evidence region rollouts based on format constraints, and an answering phase that produces responses guided by both answer correctness and format adherence rewards. Extensive experiments across multiple visual reasoning benchmarks manifest that Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement, offering a scalable and interpretable alternative to existing approaches.

Related papers

CrystaL: Spontaneous Emergence of Visual Latents in MLLMs [55.34169914483764]
We propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images.<n>By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics.<n>Experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-24T15:01:30Z)
MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions [42.03378622674476]
We propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions.<n>By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision.<n>Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations.
arXiv Detail & Related papers (2026-02-21T07:56:59Z)
Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z)
BARE: Towards Bias-Aware and Reasoning-Enhanced One-Tower Visual Grounding [30.694164546429928]
We propose BARE, a bias-aware and reasoning-enhanced framework for one-tower visual grounding.<n>We show that BARE achieves state-of-the-art performance and delivers superior computational efficiency compared to existing approaches.
arXiv Detail & Related papers (2026-01-04T13:30:06Z)
A Survey on Latent Reasoning [100.54120559169735]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities.<n>CoT reasoning that verbalizes intermediate steps limits the model's expressive bandwidth.<n>Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state.
arXiv Detail & Related papers (2025-07-08T17:29:07Z)
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z)
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning [78.17782197231325]
We propose a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective.<n> Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance.
arXiv Detail & Related papers (2025-06-05T02:28:07Z)
The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning [39.613595533503144]
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models.<n>We show that CoT consistently underperforms direct answering across varying model scales and benchmark complexities.<n>Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL.
arXiv Detail & Related papers (2025-04-07T13:51:06Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z)
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [26.757458496178437]
We introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning.<n>We construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains.<n>We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning.<n> Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL.
arXiv Detail & Related papers (2025-03-13T17:56:05Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models [36.119299938503936]
Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. They remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. We propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning.
arXiv Detail & Related papers (2024-07-16T06:32:45Z)
Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning [57.67629402360924]
We introduce the Partially Supervised Reinforcement Learning (PSRL) framework. At the heart of PSRL is the fusion of both supervised and unsupervised learning. We show that PSRL offers a potent balance, enhancing model interpretability while preserving, and often significantly outperforming, the performance benchmarks set by traditional methods.
arXiv Detail & Related papers (2024-02-14T16:23:23Z)
Interpretable Visual Question Answering via Reasoning Supervision [4.76359068115052]
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. We propose a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase.
arXiv Detail & Related papers (2023-09-07T14:12:31Z)
Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision. Existing literature addresses this challenge by employing local-based representation approaches. This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.