Fugu-MT 論文翻訳(概要): Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

論文の概要: Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

arxiv url: http://arxiv.org/abs/2605.11775v2
Date: Thu, 14 May 2026 14:02:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 15:19:49.892463
Title: Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Title（参考訳）: 強化微細チューニングにおけるエントロピー極性:方向、非対称性、制御
Authors: Jiazheng Zhang, Ziche Fu, Junrui Shen, Yunbin Zhao, Yunke Zhang, Zhiheng Xi, Long Ma, Chenxin An, Zhihao Zhang, Shichun Liu, Dingwei Zhu, Shihan Dou, Shaofan Liu, Han Li, Wiggin Zhou, Aiden Adams, Tao Gui, Fei Huang, Qi Zhang, Xuanjing Huang,
Abstract要約: 実験的に、エントロピー極性はエントロピーの変化を確実に予測することを示した。本稿では、両極性分岐を保護し、有利な再重み付けによるエントロピー制御を実装するPAPO(Polarity-Aware Policy Optimization)を提案する。
参考スコア（独自算出の注目度）: 77.8471519867791
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.
Abstract（参考訳）: 政策エントロピーは、LLMに対する検証可能な報酬(RLVR)を用いた強化学習における探索の理解と制御の基本的な手段として登場した。しかし、既存のエントロピー対応手法は、主に大域的な目的を通じてエントロピーを規制するが、サンプル化されたポリシー更新によるトークンレベルメカニズムはいまだに解明されていない。本研究では,RLVRにおけるエントロピー力学の理論的枠組みを開発する。我々の分析では、エントロピー変化の1次近似が得られ、エントロピー極性(エントロピーがどれだけ拡大するか、あるいはエントロピーを収縮するかを予測する符号付きトークンレベルの量)が生じる。頻繁な高確率トークンを補強すると収縮傾向が引き起こされるのに対し、膨張傾向は一般に低い確率サンプルまたはより強い分布補正を必要とする。実験により,エントロピー極性はエントロピー変化を確実に予測し,正極性および負極性分岐がエクスプロイトの強化と探索の維持において相補的な役割を担っていることを示す。これらの知見に基づいて、極性対応政策最適化(PAPO)を提案し、極性分岐とエントロピー制御の両方を有利な再重み付けにより実装する。経験的エントロピー軌道をオンライン位相信号として、PAPOはエントロピー拡大とエントロピー縮小の間の最適化圧力を適応的に再配置する。数学的推論とエージェントベンチマークの実験では、PAPOは競争ベースラインを一貫して上回り、優れたトレーニング効率と実質的な報酬改善を提供する。

論文の概要: Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

関連論文リスト