Fugu-MT 論文翻訳(概要): Policy Optimization Prefers The Path of Least Resistance

論文の概要: Policy Optimization Prefers The Path of Least Resistance

arxiv url: http://arxiv.org/abs/2510.21853v1
Date: Wed, 22 Oct 2025 21:48:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:14.618842
Title: Policy Optimization Prefers The Path of Least Resistance
Title（参考訳）: 政策最適化は、耐熱性の経路を優先する
Authors: Debdeep Sanyal, Aakash Sen Sharma, Dhruv Kumar, Saurabh Deshpande, Murari Mandal,
Abstract要約: 政策最適化は明確な推論を捨てることが一貫して学習されていることを示す。我々は、一連の制御された報酬分解実験を通じて、この原理を定式化する。以上の結果から,政策立案の自由は両刃剣であることが明らかとなった。
参考スコア（独自算出の注目度）: 7.4002859745101235
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Policy optimization (PO) algorithms are used to refine Large Language Models for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format to elicit chain-of-thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open-ended CoT structure remains an under-studied question. We investigate this gap with an extensive suite of controlled experiments and identify a consistent principle: \textit{policy optimization consistently follows the path of least resistance}. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct \texttt{<answer>}-only format. This outcome holds true across various models and algorithms. We find that this collapse in format is persistent even when the complex \texttt{<think><answer>} format is assigned up to 4x larger reward weights. We formalize this principle through a series of controlled reward decomposition experiments, demonstrating a clear hierarchy: PO systematically optimizes for the simplest reward component first, a preference that holds even when faced with mutually exclusive choices or strong incentives for more complex behaviors. Finally, we show that successful convergence on the high-reward shortcut is not a low-effort drift but is driven by the optimization process that requires the KL-regularized policy to have sufficient freedom to make a significant shift from its initial prior. Our findings reveal that granting policies the freedom to diverge is a double-edged sword: while necessary for discovering high-reward shortcuts, it also creates a powerful incentive to game the simplest aspects of the reward function, posing a critical challenge for reward hacking under alignment.
Abstract（参考訳）: ポリシー最適化(PO)アルゴリズムは、複雑な多段階推論のために大規模言語モデルを洗練するために使用される。現在の最先端パイプラインでは、チェーン・オブ・シント(CoT)を引き出すために厳密なシンクザイン・アンサーフォーマットが適用されているが、これらの厳密な制約がオープンなCoT構造に緩和される場合のPOの挙動は未調査のままである。このギャップを制御された実験の広範なスイートを用いて検討し、一貫した原理を同定する: \textit{policy optimization は最小抵抗の経路を一貫して従う。推論と応答をインターリーブする柔軟性がある場合、ポリシーの最適化は明示的な推論を捨てることを一貫して学び、ポリシーは直接の \texttt{<answer>} 形式に縮退する。この結果は様々なモデルやアルゴリズムに当てはまる。複雑な \texttt{<think><answer>} フォーマットが最大4倍の報酬重み付けに割り当てられた場合でも、この形式の崩壊は持続的である。我々は、この原理を一連の制御された報酬分解実験で定式化し、明確な階層性を示す: POは、まず最も単純な報酬成分を体系的に最適化する。最後に, 再帰ショートカットの収束を成功させるには, KL-正規化ポリシが初期から大きな変化を起こすのに十分な自由を要求される最適化プロセスが不可欠であることを示す。以上の結果から, 分岐の自由を政策に付与することは, 両刃の剣であることが明らかとなった。高い逆ショートカットを発見するのに必要だが, 報酬関数の最も単純な側面をゲーム化するための強力なインセンティブも生み出す。

論文の概要: Policy Optimization Prefers The Path of Least Resistance

関連論文リスト