Fugu-MT 論文翻訳(概要): ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

論文の概要: ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.15224v1
Date: Wed, 13 May 2026 08:50:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.013067
Title: ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
Title（参考訳）: ICRL:強化学習による自己批判を内部化するための学習
Authors: Jianbo Lin, Xiaomin Yu, Yi Xin, Yifu Guo, Zhuosong Jiang, Zhongqi Yue, Weishi Wang, Heqing Zou, Chengwei Qin, Hui Xiong,
Abstract要約: 大規模な言語モデルベースのエージェントは間違いを犯すが、批判はしばしば同じモデルを正しい行動へと導く。凍結した批評家は、時間とともにフィードバックの品質を改善することができず、反復的な自己改善の可能性を制限する。本稿では,自己批判を補強学習で学ぶことを提案する。これは,問題解決者と批判者を共有バックボーンから共同で訓練する新しい枠組みである。
参考スコア（独自算出の注目度）: 29.197505133648047
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.
Abstract（参考訳）: 大規模な言語モデルベースのエージェントは間違いを犯すが、批判はしばしば同じモデルを正しい行動へと導く。しかし、批判が削除された場合、モデルは同じクエリで再び失敗する可能性があるため、批判のガイダンスを根底にある能力に内部化していないことを示している。一方、凍結した批評家は、時間とともにフィードバックの品質を改善することができず、反復的な自己改善の可能性を制限する。そこで我々は,自己批判を強化学習(ICRL)で内部化する学習を提案する。これは,自己批判による成功を非支援的解決能力に変換するために,問題解決者と評論家を共有バックボーンから共同訓練する新しい枠組みである。批評家は、解決者のその後のパフォーマンス向上に基づいて報酬を受け取り、実用的なフィードバックのインセンティブを与える。 ICRLは、批判条件付きと批判自由な行動間の分布シフトに対処するため、問題解決者自身のプロンプト分布と互換性のある批判誘導改善を選択的に転送する分布校正再重み付け比を導入する。さらに、ロールワイドなグループ優位性推定は、2つのロール間での関節最適化を安定化させる。これらのメカニズムは、解決者が批判条件に依存するのではなく、外部の批判なしに自分自身を改善することを学ぶことを確実にする。エージェントおよび数学的推論タスクにまたがる多様なベンチマークにおいて, ICRL をバックボーンとして Qwen3-4B と Qwen3-8B を用いて評価した。結果は一貫した改善を示し、GRPOでは平均6.4ポイント、数学的推論では7.0ポイントだった。特に、学習した8B批判は32B批判に匹敵するが、トークンは極めて少ない。コードはhttps://github.com/brick-pid/ICRLで公開されている。

論文の概要: ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

関連論文リスト