VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
- URL: http://arxiv.org/abs/2503.07523v2
- Date: Tue, 01 Apr 2025 04:23:59 GMT
- Title: VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
- Authors: Zhangquan Chen, Xufang Luo, Dongsheng Li,
- Abstract summary: We propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception.<n>By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations.<n>Our method consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs.
- Score: 22.907814548315468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual understanding is inherently intention-driven - humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at https://github.com/zhangquanchen/VisRL.
Related papers
- Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z) - FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models [20.47311573790516]
We propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging.<n>Experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities.
arXiv Detail & Related papers (2026-01-29T02:36:19Z) - Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy [75.66913260900726]
Reinforcement Learning with Verifiable Rewards has significantly advanced reasoning capabilities in Large Language Models.<n>Existing paradigms, driven by text-centric outcome rewards, encourage models to bypass visual perception.<n>We propose textbfThinking with Deltas, a framework driven by a textbfDifferential Visual Reasoning Policy.
arXiv Detail & Related papers (2026-01-11T08:25:34Z) - Latent Implicit Visual Reasoning [59.39913238320798]
We propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision.<n>Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks.
arXiv Detail & Related papers (2025-12-24T14:59:49Z) - CoFFT: Chain of Foresight-Focus Thought for Visual Language Models [61.34272727005052]
Chain of Foresight-Focus Thought (CoFFT) is a training-free approach that enhances visual reasoning by emulating human visual cognition.<n>These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning.<n> Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8% with controllable increasing computational overhead.
arXiv Detail & Related papers (2025-09-26T07:46:30Z) - ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models [11.263321053154364]
ERGO is a reasoning-driven perception-leveraging multimodal context to determine where to focus.<n>We develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception.<n>Our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency.
arXiv Detail & Related papers (2025-09-26T07:15:19Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning [96.01617809845396]
Ground-R1 is a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations.<n>Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement.
arXiv Detail & Related papers (2025-05-26T17:51:47Z) - VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought [51.43082554363725]
We introduce textbfVLM-R$3$ (textbfVisual textbfLanguage textbfModel with textbfRegion textbfRecognition and textbfReasoning), a framework that equips an MLLM with the ability to decide emph when additional visual evidence is needed.<n>Experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$3$ sets a new
arXiv Detail & Related papers (2025-05-22T03:50:13Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs [62.9348974370985]
We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost.
Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens.
Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
arXiv Detail & Related papers (2025-03-11T11:52:37Z) - Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method.
It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach.
Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z) - Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [52.478956204238315]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z) - Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization [19.37373012848517]
Large Vision Language Models (VLMs) are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies.<n>We introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset.<n>We also introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning.
arXiv Detail & Related papers (2025-02-18T18:59:57Z) - DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests [69.00444996464662]
We present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios.<n>Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings.<n>We investigate training strategies that leverage relevant entities to improve visual reasoning.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction [80.67150791183126]
Pre-trained vision-language models (VLMs) have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks.<n>We propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations.<n>We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods.
arXiv Detail & Related papers (2024-12-09T06:34:23Z) - Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination [13.706325901731665]
Multimodal large language models (MLLMs) have advanced the integration of visual and linguistic modalities.
Current approaches like chain of thought (CoT) reasoning have augmented the cognitive capabilities of large language models (LLMs)
But their adaptation to MLLMs is hindered by heightened risks of hallucination in cross-modality comprehension.
arXiv Detail & Related papers (2024-11-15T21:01:37Z) - Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations [41.5875455113941]
We investigate whether advanced VLN models genuinely comprehend the visual content of their environments.
Surprisingly, we experimentally find that simple branch expansion, even with noisy visual inputs, paradoxically improves the navigational efficacy.
We present a versatile Multi-Branch Architecture (MBA) designed to delve into the impact of both the branch quantity and visual quality.
arXiv Detail & Related papers (2024-09-09T12:17:38Z) - ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs)<n>We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map.<n>Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - Vision-and-Language Navigation via Causal Learning [13.221880074458227]
Cross-modal causal transformer (GOAT) is a pioneering solution rooted in the paradigm of causal inference.
BACL and FACL modules promote unbiased learning by comprehensively mitigating potential spurious correlations.
To capture global confounder features, we propose a cross-modal feature pooling module supervised by contrastive learning.
arXiv Detail & Related papers (2024-04-16T02:40:35Z) - CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence.<n>After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools.<n>Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z) - Language-Guided Diffusion Model for Visual Grounding [33.714789952452094]
Existing approaches complete such visual-text reasoning in a single-step manner.<n>We propose a language-guided diffusion framework for visual grounding, LG-DVG, which trains the model to progressively reason queried object boxes.<n>Experiments on five widely used datasets validate the superior performance of solving visual grounding, a cross-modal alignment task, in a generative way.
arXiv Detail & Related papers (2023-08-18T14:54:13Z) - Learning Common Rationale to Improve Self-Supervised Representation for
Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes.
We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.