Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning
- URL: http://arxiv.org/abs/2512.15687v1
- Date: Wed, 17 Dec 2025 18:44:45 GMT
- Title: Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning
- Authors: Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, Dong Yu,
- Abstract summary: Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler.<n>G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods.
- Score: 44.07085022671951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.
Related papers
- Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning [56.29188272643489]
We propose GOLF, an RL framework that exploits group-level language feedback to guide targeted exploration.<n>GOLF aggregates external critiques that pinpoint errors or propose targeted fixes, and intra-group attempts that supply alternative partial ideas and diverse failure patterns.<n>Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency.
arXiv Detail & Related papers (2026-03-04T20:53:17Z) - Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z) - Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization [1.974921946982281]
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Recent studies question whether RL genuinely expands reasoning capacity or merely aligns existing latent capabilities, arguing that exploration remains confined within the pre-trained model's low-rank bias manifold.<n>We propose Manifold-Reshaping Policy Optimization (MRPO), a geometric framework designed to fundamentally restructure the inference space of LLMs.
arXiv Detail & Related papers (2026-01-30T05:38:44Z) - Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z) - GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment [16.343768407636322]
We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards.<n>We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation.
arXiv Detail & Related papers (2025-12-01T18:03:29Z) - Cognitive Maps in Language Models: A Mechanistic Analysis of Spatial Planning [2.1115884707107715]
We train GPT-2 models on three spatial learning paradigms in grid environments.<n>Using behavioural, representational and mechanistic analyses, we uncover two fundamentally different learned algorithms.
arXiv Detail & Related papers (2025-11-17T13:46:19Z) - Off-policy Reinforcement Learning with Model-based Exploration Augmentation [29.61835214523957]
We propose Modelic Generative Exploration (MoGE), which augments exploration through the generation of under-explored critical states.<n>MoGE is composed of two components: (1) a diffusion-based generator that synthesizes critical states under the guidance of a utility function evaluating each state's potential influence on policy exploration, and (2) a one-step imagination world model for constructing critical transitions based on the critical states for agent learning.<n>Our method adopts a modular formulation that aligns with the principles of off-policy learning, allowing seamless integration with existing algorithms to improve exploration without altering their core structures.
arXiv Detail & Related papers (2025-10-29T13:53:52Z) - Inpainting-Guided Policy Optimization for Diffusion Large Language Models [67.97530437998117]
Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs.<n>We explore how inpainting can inform RL algorithm design for dLLMs.
arXiv Detail & Related papers (2025-09-12T16:44:31Z) - RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z) - Subequivariant Graph Reinforcement Learning in 3D Environments [34.875774768800966]
We propose a novel setup for morphology-agnostic RL, dubbed Subequivariant Graph RL in 3D environments.
Specifically, we first introduce a new set of more practical yet challenging benchmarks in 3D space.
To optimize the policy over the enlarged state-action space, we propose to inject geometric symmetry.
arXiv Detail & Related papers (2023-05-30T11:34:57Z) - Model-Free Generative Replay for Lifelong Reinforcement Learning:
Application to Starcraft-2 [5.239932780277599]
Generative replay (GR) is a biologically-inspired replay mechanism that augments learning experiences with self-labelled examples.
We present a version of GR for LRL that satisfies two desideratas: (a) Introspective density modelling of the latent representations of policies learned using deep RL, and (b) Model-free end-to-end learning.
arXiv Detail & Related papers (2022-08-09T22:00:28Z) - GEM: Group Enhanced Model for Learning Dynamical Control Systems [78.56159072162103]
We build effective dynamical models that are amenable to sample-based learning.
We show that learning the dynamics on a Lie algebra vector space is more effective than learning a direct state transition model.
This work sheds light on a connection between learning of dynamics and Lie group properties, which opens doors for new research directions.
arXiv Detail & Related papers (2021-04-07T01:08:18Z) - Nested-Wasserstein Self-Imitation Learning for Sequence Generation [158.19606942252284]
We propose the concept of nested-Wasserstein distance for distributional semantic matching.
A novel nested-Wasserstein self-imitation learning framework is developed, encouraging the model to exploit historical high-rewarded sequences.
arXiv Detail & Related papers (2020-01-20T02:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.