Fugu-MT 論文翻訳(概要): HAEPO: History-Aggregated Exploratory Policy Optimization

論文の概要: HAEPO: History-Aggregated Exploratory Policy Optimization

arxiv url: http://arxiv.org/abs/2508.18884v1
Date: Tue, 26 Aug 2025 09:59:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-27 17:42:38.792087
Title: HAEPO: History-Aggregated Exploratory Policy Optimization
Title（参考訳）: HAEPO:履歴集約型探査政策最適化
Authors: Gaurish Trivedi, Alakh Sharma, Kartikey Singh Bhandari, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa,
Abstract要約: 本稿では,ヒストリーアグリゲート探索政策最適化(HAEPO)を紹介する。 HAEPOは各軌道を対数確率の和に圧縮し、軌道にプラケット・リュックソフトマックスを適用する。実証的には、HAEPOは急速に収束し、徹底的に探索し、真の報酬と密接に一致し、PPO、GRPO、DPOと同等以上の堅牢な学習行動を示す。
参考スコア（独自算出の注目度）: 4.782714372521615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Exploration is essential in modern learning, from reinforcement learning environments with small neural policies to large language models (LLMs). Existing work, such as DPO, leverages full sequence log-likelihoods to capture an entire trajectory of the model's decisions, while methods like GRPO aggregate per-token ratios into a trajectory-level update. However, both often limit exploration on long-horizon tasks. We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat these shortcomings. HAEPO compresses each trajectory into the sum of its logarithmic probabilities (a cumulative logarithmic likelihood), and applies a Plackett-Luce softmax across trajectories to obtain normalized weights proportional to their returns, thus encouraging broader exploration. We add entropy regularization to stabilize the aggressive updates to prevent premature collapse and a soft KL penalty relative to a frozen copy of the previous (reference) policy. Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO across diverse tasks. Thus, HAEPO provides a stable and interpretable framework by explicitly leveraging full-trajectory history while balancing exploration and stability.
Abstract（参考訳）: 探索は、小さなニューラルポリシーを持つ強化学習環境から、大きな言語モデル(LLM)まで、現代的な学習において不可欠である。 DPOのような既存の作業は、完全なシーケンスログライクフッドを活用して、モデル決定の全軌道をキャプチャし、GRPOのようなメソッドはトーケン比をトラジェクトリレベルの更新に集約する。しかし、どちらも長い水平課題の探索を制限していることが多い。本稿では,これらの問題点に対処するため,歴史対応型探索政策最適化(HAEPO)を導入する。 HAEPO は各軌道を対数確率の和に圧縮し(累積対数確率)、軌道に対してプラケット・リュックソフトマックスを適用してそれらの回帰に比例した正規化重量を得る。攻撃的な更新を安定させるためにエントロピー正則化を追加し、以前の(参照)ポリシーの凍結コピーに対する早期崩壊やソフトKLペナルティを防止する。実証的には、HAEPOは急速に収束し、徹底的に探索し、真の報酬と密接に一致し、多岐にわたるPPO、GRPO、DPOに匹敵する堅牢な学習行動を示す。したがって、HAEPOは、探索と安定性のバランスを保ちながら、完全な軌道履歴を明示的に活用することで、安定かつ解釈可能なフレームワークを提供する。

論文の概要: HAEPO: History-Aggregated Exploratory Policy Optimization

関連論文リスト