Fugu-MT 論文翻訳(概要): ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

論文の概要: ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

arxiv url: http://arxiv.org/abs/2603.05863v1
Date: Fri, 06 Mar 2026 03:38:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:44.984587
Title: ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
Title（参考訳）: ReflexiCoder: 強化学習による大規模言語モデルによる生成コードと自己補正の指導
Authors: Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim, Sungju Kim,
Abstract要約: 既存の反復的な改善戦略は、外部のオラクル、実行フィードバック、あるいは計算に高価なプロンプト応答サイクルに依存している。構造的推論軌道を内在化する新しい強化学習(RL)フレームワークであるReflexiCoderを提案する。私たちのフレームワークはベースモデルよりもトークン効率がかなり高く、推論時の計算オーバーヘッドを約40%削減します。
参考スコア（独自算出の注目度）: 17.115542346570972
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at https://github.com/juyongjiang/ReflexiCoder.
Abstract（参考訳）: LLM(Large Language Models)はコード生成に革命をもたらしたが、標準的な"System 1"アプローチでは、単一のフォワードパスでソリューションを生成し、複雑なアルゴリズムタスクに直面した場合、しばしばパフォーマンスの天井に達する。既存の反復的な洗練戦略は、推論時にこのギャップを埋めようとするが、それらは主に外部のオラクル、実行フィードバック、計算に高価な応答サイクルに依存している。本研究では,構造的推論軌道を内在化する新しい強化学習(RL)フレームワークであるReflexiCoderを提案する。従来の方法とは異なり、ReflexiCoderは、パラダイムを外部依存の洗練から、推論時に固有の完全に自律的な自己反射と自己補正能力に移行する。提案手法では,RL-ゼロの学習パラダイムを用いて反射補正軌道全体を最適化し,提案モデルに予測時の地中フィードバックや実行エンジンに頼らずにデバッグ方法を教える。 7つのベンチマークにわたる大規模な実験により、私たちのReflexiCoder-8Bは1.5B-14B範囲で主要なオープンソースモデルのうち、新しい最先端(SOTA)を確立し、HumanEval(Plus)で94.51%(87.20%)、MBPP(Plus)で81.80%(78.57%)、BigCodeBenchで35.00%、LiveCodeBenchで52.21%、GPT-5.1のようなプロプライエタリなモデルで37.34%をCodeForcesで達成した。特に、我々のフレームワークはベースモデルよりもトークン効率がかなり高く、規律付き高速推論およびリフレクションパターンにより、推論時の計算オーバーヘッドを約40%削減している。ソースコードはhttps://github.com/juyongjiang/ReflexiCoder.comで入手できる。

論文の概要: ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

関連論文リスト