Related papers: HAEPO: History-Aggregated Exploratory Policy Optimization

HAEPO: History-Aggregated Exploratory Policy Optimization

URL: http://arxiv.org/abs/2508.18884v1
Date: Tue, 26 Aug 2025 09:59:44 GMT
Title: HAEPO: History-Aggregated Exploratory Policy Optimization
Authors: Gaurish Trivedi, Alakh Sharma, Kartikey Singh Bhandari, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa,
Abstract summary: We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat shortcomings.<n>HAEPO compresses each trajectory into the sum of its logarithmic probabilities, and applies a Plackett-Luce softmax across trajectories.<n> Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO.
Score: 4.782714372521615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Exploration is essential in modern learning, from reinforcement learning environments with small neural policies to large language models (LLMs). Existing work, such as DPO, leverages full sequence log-likelihoods to capture an entire trajectory of the model's decisions, while methods like GRPO aggregate per-token ratios into a trajectory-level update. However, both often limit exploration on long-horizon tasks. We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat these shortcomings. HAEPO compresses each trajectory into the sum of its logarithmic probabilities (a cumulative logarithmic likelihood), and applies a Plackett-Luce softmax across trajectories to obtain normalized weights proportional to their returns, thus encouraging broader exploration. We add entropy regularization to stabilize the aggressive updates to prevent premature collapse and a soft KL penalty relative to a frozen copy of the previous (reference) policy. Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO across diverse tasks. Thus, HAEPO provides a stable and interpretable framework by explicitly leveraging full-trajectory history while balancing exploration and stability.

Related papers

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z)
RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization [40.41228010377401]
We propose Rephrasing Policy Optimization (RePO) to reconcile off-policy knowledge with the stability of on-policy RL.<n>RePO rephrases off-policy knowledge into trajectories that conform to its own stylistic and parametric distribution.<n> Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines.
arXiv Detail & Related papers (2026-02-11T13:02:40Z)
DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training [94.568675548967]
Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain generalization.<n>Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar.<n>We propose DFPO, a robust distributional RL framework that models values as continuous flows across time steps.
arXiv Detail & Related papers (2026-02-05T17:07:42Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models [67.41779761651924]
SOUP is a framework that unifies off- and on-policy learning within individual samples at the token level.<n>It consistently outperforms standard on-policy training and existing off-policy extensions.
arXiv Detail & Related papers (2026-01-29T09:56:15Z)
Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off [34.80019950191864]
We introduce a textbfdistribution-centric perspective for reinforcement learning.<n>We propose Distribution-Centric Policy Optimization (DCPO), which reformulates entropy regulation as distribution-level regularization.<n>Overall, DCPO replaces sample-levels with distribution-level principles, offering a theoretically grounded and flexible framework for exploration and a stronger EE trade-off.
arXiv Detail & Related papers (2026-01-19T05:20:46Z)
Coverage Improvement and Fast Convergence of On-policy Preference Learning [67.36750525893514]
Online on-policy preference learning algorithms for language model alignment can significantly outperform their offline counterparts.<n>We analyze how the sampling policy's coverage evolves throughout on-policy training.<n>We develop principled on-policy schemes for reward distillation in the general function class setting.
arXiv Detail & Related papers (2026-01-13T10:46:06Z)
IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck [20.113524065146674]
Iterative Information Bottleneck (IIB-LPO) is a novel approach that shifts exploration from statistical perturbation of token to topological branching of reasoning trajectories.<n>IIB-LPO achieves state-of-the-art performance, surpassing prior methods by margins of up to 5.3% in accuracy and 7.4% in diversity metrics.
arXiv Detail & Related papers (2026-01-09T15:46:40Z)
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z)
Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions [0.5416466085090772]
We introduce emphQuantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards.<n> QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective.<n>It consistently achieves top performance on chat and coding evaluations.
arXiv Detail & Related papers (2025-07-10T17:56:24Z)
GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning [34.25769740497309]
GenPO is a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings.<n>GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.
arXiv Detail & Related papers (2025-05-24T15:57:07Z)
Entropy Controllable Direct Preference Optimization [3.536605202672355]
We propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy.<n>In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks.
arXiv Detail & Related papers (2024-11-12T07:09:44Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
We show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. We discuss applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
arXiv Detail & Related papers (2024-04-18T17:37:02Z)
APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized. The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.