Boosting Reasoning in Large Multimodal Models via Activation Replay
- URL: http://arxiv.org/abs/2511.19972v2
- Date: Thu, 27 Nov 2025 10:11:45 GMT
- Title: Boosting Reasoning in Large Multimodal Models via Activation Replay
- Authors: Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan, Shijian Lu, Yu-Gang Jiang,
- Abstract summary: We show how RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected.<n>We propose Activation Replay, a training-free approach that boosts multimodal reasoning of post-trained LMMs.
- Score: 136.6522463570943
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.
Related papers
- Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs [16.74831908818562]
Recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards.<n>We identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades.<n>We uncover a hidden Anchor-Adapter circuit that facilitates this shortcut.
arXiv Detail & Related papers (2026-01-16T07:55:38Z) - Spotlight on Token Perception for Multimodal Reinforcement Learning [65.97597482517425]
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs)<n>In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception.<n>We propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal.
arXiv Detail & Related papers (2025-10-10T11:25:33Z) - Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning [53.35553353785948]
Motivated by the puzzling observation that inserting long sequences of meaningless tokens before the query prompt can consistently enhance reasoning LLM performance, this work analyzes the underlying mechanism driving this phenomenon.<n>We find that the improvements arise from a redistribution of activations in the LLM's layers, where near zero activations become less frequent while large magnitude activations increase.<n>We propose a lightweight inference-time technique that modifies activations directly without altering the input sequence.
arXiv Detail & Related papers (2025-10-01T15:39:38Z) - Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs [35.27561531876348]
This paper systematically investigates the impact of Reinforcement Learning with Verifiable Rewards (RLVR) on Large Language Models (LLMs)<n>We show that RLVR can extend the reasoning boundary for both mathematical and coding tasks.<n>We present a theoretical framework explaining RLVR's incentive mechanism, demonstrating how it can encourage correct reasoning even when rewards are based solely on answer correctness.
arXiv Detail & Related papers (2025-06-17T07:06:56Z) - Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought [58.321044666612174]
Vad-R1 is an end-to-end MLLM-based framework for Video Anomaly Reasoning.<n>We design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies.<n>We also propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs.
arXiv Detail & Related papers (2025-05-26T12:05:16Z) - Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models [45.938663388013445]
We show that a small set of high-impact activations in the last few layers governs long-form reasoning attributes.<n>By simply amplifying these activations and inserting "wait" tokens, we can invoke the long CoT ability without any training.
arXiv Detail & Related papers (2025-05-23T10:07:18Z) - Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [66.61292196146016]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs)<n>This study critically examines the current state of RLVR.<n>We find that the current training setup does not elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z) - GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.