Spotlight on Token Perception for Multimodal Reinforcement Learning
- URL: http://arxiv.org/abs/2510.09285v1
- Date: Fri, 10 Oct 2025 11:25:33 GMT
- Title: Spotlight on Token Perception for Multimodal Reinforcement Learning
- Authors: Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng,
- Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs)<n>In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception.<n>We propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal.
- Score: 65.97597482517425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.
Related papers
- AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models [8.749398216116626]
We conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms.<n>Our analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended.<n>We show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance.
arXiv Detail & Related papers (2026-03-01T19:14:39Z) - Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization [56.083511902353365]
Reinforcement learning (RL) typically applies uniform credit across an entire generation of Large language models.<n>This work positions attention as a privileged substrate that renders the internal logic of LLMs as a mechanistic blueprint of reasoning itself.<n>We introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes.
arXiv Detail & Related papers (2025-10-15T13:49:51Z) - Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models [33.78309915588303]
Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs)<n>We propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of vision-language models (VLMs)<n>After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities.
arXiv Detail & Related papers (2025-09-16T12:51:11Z) - Perception-Aware Policy Optimization for Multimodal Reasoning [79.56070395437898]
A major source of error in current multimodal reasoning lies in the perception of visual inputs.<n>We propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason.<n>We observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.
arXiv Detail & Related papers (2025-07-08T23:22:34Z) - PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning [50.21619363035618]
We propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks.<n>We introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity.<n>Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin.
arXiv Detail & Related papers (2025-06-17T18:25:56Z) - Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward [77.34936657745578]
We propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately.<n>We show that Perception-R1 achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
arXiv Detail & Related papers (2025-06-08T16:48:42Z) - VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning [45.39372905700317]
We introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information.<n>With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories.<n>Our approach highlights key limitations of RL in RAG domains.
arXiv Detail & Related papers (2025-05-28T06:30:51Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.