Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
- URL: http://arxiv.org/abs/2509.23808v2
- Date: Tue, 30 Sep 2025 18:42:02 GMT
- Title: Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
- Authors: Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang,
- Abstract summary: A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off.<n>We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level.<n>We propose Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement.
- Score: 15.147456927849932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.
Related papers
- Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward [33.74512650901766]
The paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR)<n>Recent studies suggest that RLVR can elicit strong mathematical reasoning in Large Language Models (LLMs)<n>Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
arXiv Detail & Related papers (2025-12-18T18:59:27Z) - PACR: Progressively Ascending Confidence Reward for LLM Reasoning [55.06373646059141]
We propose Progressively Ascending Confidence Reward (PACR)<n>PACR is a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer.<n>Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.
arXiv Detail & Related papers (2025-10-25T11:25:35Z) - Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL [32.854183226427395]
We study Single-Goal Contrastive Reinforcement Learning (SGCRL), a self-supervised algorithm capable of solving challenging long-horizon goal-reaching tasks.<n>We show that SGCRL maximizes implicit rewards shaped by its learned representations.<n>Our improved understanding enables us to adapt SGCRL to perform safety-aware exploration.
arXiv Detail & Related papers (2025-10-15T21:55:14Z) - Diversity-Incentivized Exploration for Versatile Reasoning [63.653348177250756]
We propose textbfDIVER (textbfDi-textbfIncentivized Exploration for textbfVersatiltextbfE textbfReasoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning.
arXiv Detail & Related papers (2025-09-30T13:11:46Z) - CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models [85.315711639214]
We introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration.<n>For the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture.<n>Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses.
arXiv Detail & Related papers (2025-09-11T17:59:17Z) - From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR [92.51110344832178]
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs)<n>This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects.
arXiv Detail & Related papers (2025-08-11T01:26:16Z) - Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs [112.40801692473723]
Balancing exploration and exploitation is a central goal in reinforcement learning (RL)<n>We introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term.<n>Our method achieves significant gains on the Pass@K metric, even when evaluated with extremely large K values.
arXiv Detail & Related papers (2025-06-17T17:54:03Z) - Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration [39.460202867967006]
We propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR) to deliver dense rewards and amplify exploration in the RL-based paradigm.<n>Experiments across 4 public datasets demonstrate i-MENTOR's effectiveness, achieving a 22.23% improvement on AIME 2024.
arXiv Detail & Related papers (2025-05-23T08:30:28Z) - On the Importance of Exploration for Generalization in Reinforcement
Learning [89.63074327328765]
We propose EDE: Exploration via Distributional Ensemble, a method that encourages exploration of states with high uncertainty.
Our algorithm is the first value-based approach to achieve state-of-the-art on both Procgen and Crafter.
arXiv Detail & Related papers (2023-06-08T18:07:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.