Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning
- URL: http://arxiv.org/abs/2602.11455v1
- Date: Thu, 12 Feb 2026 00:20:54 GMT
- Title: Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning
- Authors: Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang,
- Abstract summary: We show how visual evidence is integrated during reasoning remains poorly understood.<n>We propose Anchor-Token Reinforcement Learning (AT-RL), a lightweight framework that selectively reinforces high-connectivity tokens.<n>Our work reveals that reasoning quality is governed not by token quantity but by the fidelity of cross-modal anchoring.
- Score: 11.021067780524348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet how visual evidence is integrated during reasoning remains poorly understood. We explore multimodal RLVR through the lens of cross-modal attention connectivity and find that only a small fraction of tokens (approximately 15%) exhibit strong visual-textual coupling. These high-connectivity tokens act as anchors that ground reasoning in the image, while the majority follow linguistic patterns. During RLVR training, credit assignment naturally concentrates on these anchors, sharpening their visual grounding over time. Building on this insight, we propose Anchor-Token Reinforcement Learning (AT-RL), a lightweight framework that selectively reinforces high-connectivity tokens via graph-based clustering of attention topology. Evaluated across the series (3B-32B), AT-RL introduces only 1.2% overhead yet enables the 32B model to surpass the 72B-Instruct baseline on MathVista (80.2), with consistent gains observed across STEM, video and general tasks. Conversely, training solely on low-connectivity tokens causes severe degradation, confirming that effective multimodal RL hinges on precise credit assignment to visual anchors. Our work reveals that reasoning quality is governed not by token quantity but by the fidelity of cross-modal anchoring.
Related papers
- From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning [72.4876727619987]
We find that reasoning performance is strongly correlated with Visual Attention Score (VAS)<n>To validate its causal role, we design training-free interventions that directly modulate attention allocation during inference.<n>We propose Attention-Guided Visual Anchoring and Reflection, a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping.
arXiv Detail & Related papers (2026-03-04T08:22:27Z) - $\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs [26.779915891040236]
We propose emphVisiPruner, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B.<n>Our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics.
arXiv Detail & Related papers (2025-10-20T06:40:17Z) - Spotlight on Token Perception for Multimodal Reinforcement Learning [65.97597482517425]
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs)<n>In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception.<n>We propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal.
arXiv Detail & Related papers (2025-10-10T11:25:33Z) - AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors [3.9039205692819547]
We propose Attention Anchor, a parameter-free framework that efficiently groups semantically similar tokens across modalities.<n>By inserting text tokens near relevant visual patches, we create semantic signposts that reveal true content-based cross-modal attention scores.<n>AttAnchor achieves improvements across 13 out of 15 different metrics and benchmarks, including up to 32% gains on reasoning tasks and up to 15% improvements on benchmarks.
arXiv Detail & Related papers (2025-09-27T04:37:26Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model [82.93634081255942]
We propose a vision-language connector that enables MLLMs to achieve high accuracy while maintain low cost.
We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to extract them.
We introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining.
arXiv Detail & Related papers (2024-05-28T04:23:00Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.