Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling
- URL: http://arxiv.org/abs/2602.14169v1
- Date: Sun, 15 Feb 2026 14:44:15 GMT
- Title: Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling
- Authors: Yiran Guo, Zhongjian Qiao, Yingqi Xie, Jie Liu, Dan Ye, Ruiqing Zhang, Shuang Qiu, Lijie Xu,
- Abstract summary: Deep Dense Exploration (DDE) is a strategy that focuses exploration on $textitpivots$-deep, recoverable states within unsuccessful trajectories.<n>Our method consistently outperforms GRPO, tree-based methods, and other strong baselines.
- Score: 13.584783462913535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $\textit{pivots}$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines.
Related papers
- Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning [56.29188272643489]
We propose GOLF, an RL framework that exploits group-level language feedback to guide targeted exploration.<n>GOLF aggregates external critiques that pinpoint errors or propose targeted fixes, and intra-group attempts that supply alternative partial ideas and diverse failure patterns.<n>Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency.
arXiv Detail & Related papers (2026-03-04T20:53:17Z) - Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities [10.235183326885794]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs)<n>We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths.<n>We propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses.
arXiv Detail & Related papers (2026-02-05T04:06:55Z) - Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization [1.974921946982281]
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Recent studies question whether RL genuinely expands reasoning capacity or merely aligns existing latent capabilities, arguing that exploration remains confined within the pre-trained model's low-rank bias manifold.<n>We propose Manifold-Reshaping Policy Optimization (MRPO), a geometric framework designed to fundamentally restructure the inference space of LLMs.
arXiv Detail & Related papers (2026-01-30T05:38:44Z) - IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck [20.113524065146674]
Iterative Information Bottleneck (IIB-LPO) is a novel approach that shifts exploration from statistical perturbation of token to topological branching of reasoning trajectories.<n>IIB-LPO achieves state-of-the-art performance, surpassing prior methods by margins of up to 5.3% in accuracy and 7.4% in diversity metrics.
arXiv Detail & Related papers (2026-01-09T15:46:40Z) - Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards [48.321707628011005]
Lookahead Tree-Based Rollouts (LATR) is a novel rollout strategy designed to explicitly promote trajectory-level diversity.<n>LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2%.
arXiv Detail & Related papers (2025-10-28T11:12:02Z) - VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning [62.09195763860549]
Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration.<n>We introduce $textbfVOGUE (Visual Uncertainty Guided Exploration)$, a novel method that shifts exploration from the output (text) to the input (visual) space.<n>Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.
arXiv Detail & Related papers (2025-10-01T20:32:08Z) - DDL: A Large-Scale Datasets for Deepfake Detection and Localization in Diversified Real-World Scenarios [51.916287988122406]
We present a novel large-scale deepfake detection and localization (textbfDDL) dataset containing over $textbf1.4M+$ forged samples.<n>Our DDL not only provides a more challenging benchmark for complex real-world forgeries but also offers crucial support for building next-generation deepfake detection, localization, and interpretability methods.
arXiv Detail & Related papers (2025-06-29T15:29:03Z) - Scalable Online Exploration via Coverability [45.66375686120087]
Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation.
We introduce a new objective, $L_Coverage, which generalizes previous exploration schemes and supports three fundamental desideratas.
$L_Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability.
arXiv Detail & Related papers (2024-03-11T10:14:06Z) - Accelerating Inverse Learning via Intelligent Localization with
Exploratory Sampling [1.5976506570992293]
solving inverse problems is a longstanding challenge in materials and drug discovery.
Deep generative models are recently proposed to solve inverse problems.
We propose a novel approach (called iPage) to accelerate the inverse learning process.
arXiv Detail & Related papers (2022-12-02T08:00:04Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Regressive Domain Adaptation for Unsupervised Keypoint Detection [67.2950306888855]
Domain adaptation (DA) aims at transferring knowledge from a labeled source domain to an unlabeled target domain.
We present a method of regressive domain adaptation (RegDA) for unsupervised keypoint detection.
Our method brings large improvement by 8% to 11% in terms of PCK on different datasets.
arXiv Detail & Related papers (2021-03-10T16:45:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.