Related papers: Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

URL: http://arxiv.org/abs/2509.09265v1
Date: Thu, 11 Sep 2025 08:50:01 GMT
Title: Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
Authors: Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang,
Abstract summary: Entropy-Modulated Policy Gradients (EMPG) is a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome.<n> EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration.
Score: 24.972357127546772
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/

Related papers

GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning [55.03441672267886]
We propose GradAlign, a gradient-aligned data selection method for reinforcement learning.<n>We evaluate GradAlign across three data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus.
arXiv Detail & Related papers (2026-02-25T01:54:50Z)
Evolutionary Strategies lead to Catastrophic Forgetting in LLMs [51.91763220981834]
Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms.<n>ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget.<n>ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online.
arXiv Detail & Related papers (2026-01-28T18:59:34Z)
Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z)
Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z)
Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control [50.316067647636196]
This paper introduces Succeed or Learn Slowly (SoLS), a novel off-policyReinforcement learning algorithm evaluated on mobile app control tasks.<n>SoLS improves sample efficiency when fine-tuning foundation models for user interface navigation via a modified off-policy actor-critic approach.<n>We augment SoLS with Successful Transition Replay (STR), which prioritises learning from successful interactions.
arXiv Detail & Related papers (2025-09-01T18:55:27Z)
Beyond Freezing: Sparse Tuning Enhances Plasticity in Continual Learning with Pre-Trained Models [10.904981532789824]
Continual Learning with Pre-trained Models holds great promise for efficient adaptation across sequential tasks.<n>Existing approaches freeze PTMs and rely on auxiliary modules like prompts or adapters.<n>We propose Mutual Information-guided Sparse Tuning (MIST), a plug-and-play method that selectively updates a small subset of PTM parameters.
arXiv Detail & Related papers (2025-05-26T13:09:25Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
Large language models (LLMs) are becoming more capable and widespread.<n>Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks.<n>In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
Bayesian Inverse Transition Learning for Offline Settings [30.10905852013852]
Reinforcement learning is commonly used for sequential decision-making in domains such as healthcare and education. We propose a new constraint-based approach that captures our desiderata for reliably learning a posterior distribution of the transition dynamics $T$. Our results demonstrate that by using our constraints, we learn a high-performing policy, while considerably reducing the policy's variance over different datasets.
arXiv Detail & Related papers (2023-08-09T17:08:29Z)
Simplifying Deep Reinforcement Learning via Self-Supervision [51.2400839966489]
Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses. We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
arXiv Detail & Related papers (2021-06-10T06:29:59Z)
DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled. Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step. Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.