Fugu-MT 論文翻訳(概要): Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

論文の概要: Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

arxiv url: http://arxiv.org/abs/2509.22824v1
Date: Fri, 26 Sep 2025 18:30:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:18.894085
Title: Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning
Title（参考訳）: Critique-Coder: 批判強化学習によるコーダモデルの強化
Authors: Chi Ruan, Dongfu Jiang, Yubo Wang, Wenhu Chen,
Abstract要約: 強化学習(Reinforcement Learning, RL)は、特に推論モデルと組み合わせた場合、一般的なトレーニングパラダイムとして現れている。本稿では,モデルが与えられた(探索,解)ペアに対する批判を生成するための批判強化学習(CRL)を提案する。 textscCritique-Coderは、RLのみのベースラインを異なるベンチマークで一貫して上回ることを示す。
参考スコア（独自算出の注目度）: 49.35842828047236
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label $c \in \{\texttt{True}, \texttt{False}\}$ of the generated critique aligns with the ground-truth judgment $c^*$. Building on this point, we introduce \textsc{Critique-Coder}, which is trained on a hybrid of RL and CRL by substituting 20\% of the standard RL data with CRL data. We fine-tune multiple models (\textsc{Critique-Coder}) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that \textsc{Critique-Coder} consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our \textsc{Critique-Coder-8B} can reach over 60\% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, \textsc{Critique-Coder} also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)は、特に推論モデルと組み合わせた場合、一般的なトレーニングパラダイムとして現れている。効果はあるものの、主に反応の生成に焦点が当てられ、批判や反射を明示的に促進するメカニズムが欠如している。 CFT(Critique-Fine-Tuning)やCGD(Critique-Guided-Distillation)といった最近の研究は、LCMに批判の仕方を明示的に教えることの利点を示している。それらに動機づけられた批判強化学習(CRL)を提案し,そのモデルが与えられた(探索,解)ペアに対する批判を生成する。報酬は、最終的な判定ラベル $c \in \{\textt{True}, \texttt{False}\}$ でのみ決定される。この点に基づいて、標準RLデータの20%をCRLデータに置換することにより、RLとCRLのハイブリッドで訓練されるtextsc{Critique-Coder}を導入する。我々は、複数のモデル(\textsc{Critique-Coder})を微調整し、異なるベンチマークで評価し、RLのみのモデルに対する利点を示す。評価されたすべてのベンチマークにおいて, \textsc{Critique-Coder} が RL のみのベースラインを一貫して上回っていることを示す。私たちの \textsc{Critique-Coder-8B} は LiveCodeBench (v5) 上で 60 % 以上に達することができ、DeepCoder-14B や GPT-o1 といった他の推論モデルよりも優れています。コード生成以外にも、BBEHデータセットからの論理推論タスクのパフォーマンス向上が証明されているように、‘textsc{Critique-Coder} は一般的な推論能力の向上も示す。このことは、CRLのコーディングデータセットへの適用により、幅広いタスクにまたがって転送可能な一般的な推論と批判能力が向上することを示している。したがって、CRLはLLM推論の標準RLを大いに補完するものとして機能すると考えられる。

論文の概要: Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

関連論文リスト