Fugu-MT 論文翻訳(概要): RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

論文の概要: RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

arxiv url: http://arxiv.org/abs/2511.04285v1
Date: Thu, 06 Nov 2025 11:27:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.404107
Title: RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization
Title（参考訳）: RLoop: 反復的政策初期化による強化学習のための自己改善フレームワーク
Authors: Zeng Zhiyuan, Jiashuo Liu, Zhangyue Yin, Ge Zhang, Wenhao Huang, Xipeng Qiu,
Abstract要約: 大規模な推論モデルをトレーニングするための自己改善フレームワークであるRLoopを紹介します。 RLoopはまず、RLを使用して所定のポリシからソリューション空間を探索し、成功したトラジェクトリをフィルタリングしてエキスパートデータセットを作成する。実験の結果、RLoopsは一般化を忘れて大幅に改善し、平均精度は9%、pass@32はバニラRLに比べて15%以上向上した。
参考スコア（独自算出の注目度）: 65.23034604711489
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.
Abstract（参考訳）: Reinforcement Learning for Verifiable Rewards (RLVR) は大規模な推論モデルのトレーニングには強力だが、そのトレーニングダイナミクスは重要な課題となっている。我々の分析によると、これは訓練中に生じる様々なソリューションの過度な特殊化と破滅的な忘れによって引き起こされている。標準最適化は、この貴重なステップ間ポリシーの多様性を捨てます。この問題に対処するために、反復的なポリシー初期化に基づく自己改善フレームワークであるRLoopを紹介します。 RLoopはまず、RLを使用して所定のポリシからソリューション空間を探索し、成功したトラジェクトリをフィルタリングしてエキスパートデータセットを作成する。このデータセットは、Rejection-Sampling Fine-Tuning (RFT)を介して初期ポリシーを洗練し、次のイテレーションの出発点として優れている。この反復的再初期化による探索と利用のループは、過渡的な政策変動を堅牢なパフォーマンスゲインに変換するのに効果的である。実験の結果, RLoopは, バニラRLに比べて平均精度を9%以上,pass@32を15%以上向上することがわかった。

論文の概要: RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

関連論文リスト