Fugu-MT 論文翻訳(概要): Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

論文の概要: Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

arxiv url: http://arxiv.org/abs/2606.17890v1
Date: Tue, 16 Jun 2026 13:10:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.443695
Title: Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models
Title（参考訳）: RL-Trained Reasoning Modelにおける再考のための動的ロールアウト編集
Authors: Zihao Wei, Wenjie Shi, Liang Pang, Jingcheng Deng, Shicheng Xu, Shasha Guo, Zenghao Duan, Jiahao Liu, Jingang Wang, Huawei Shen, Xueqi Cheng,
Abstract要約: 長い形式の連鎖推論は複雑なタスクのパフォーマンスを向上させることができる。しかし、正しい答えが現れた後、モデルはしばしば不要な推論を生成し続ける。我々はこの現象をGRPO型強化学習の観点から研究する。
参考スコア（独自算出の注目度）: 102.76983747945836
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style reinforcement learning (RL) post-training, framing it as a training-time credit-assignment problem rather than merely a decoding-time stopping problem. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence-level credit, it cannot distinguish the solution-reaching prefix from the unnecessary continuation that lengthens a successful trajectory. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training. To address this issue, we introduce Dynamic Rollout Editing (DRE), a training-time intervention for successful trajectories that continue thinking after answer emergence. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer. Experiments across diverse tasks show the effectiveness of DRE.
Abstract（参考訳）: ロングフォーム・チェーン・オブ・ソート推論は複雑なタスクにおけるLLMのパフォーマンスを向上させることができるが、モデルはしばしば正しい答えが現れた後、不要な推論を生成し続ける。私たちはこの行為を過度に考え過ぎだと考えている。我々は,この現象をGRPO型強化学習(RL)後学習の観点から検討し,単に復号時停止問題ではなく,訓練時クレジット割り当て問題とみなす。 GRPOトレーニングの開始時に採取したロールアウトでは、成功した軌道は、同じプロンプトで失敗した軌道よりもわずかに過度に過大な考えを示すことが観察された。この初期の不均衡は、望ましくないフィードバックループの出発点となる:GRPOはシーケンスレベルのクレジットを割り当てているため、成功軌道を延長する不要な継続と解を導くプレフィックスを区別することはできない。どちらも肯定的な更新信号を受け取り、トレーニング中に初期不均衡がより深刻に再考されるようになる。この問題に対処するために,動的ロールアウト編集(Dynamic Rollout Editing, DRE)を導入する。 DREは、承認された前置詞を保存し、残りの思考を編集し、同じRLグループ内の編集された軌跡を好んで、答えに到達するために必要な推論を罰することなく、不必要な思考のための選好信号を弱める。多様なタスクにわたる実験は、DREの有効性を示している。

論文の概要: Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

関連論文リスト