Fugu-MT 論文翻訳(概要): Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models

論文の概要: Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models

arxiv url: http://arxiv.org/abs/2603.07777v1
Date: Sun, 08 Mar 2026 19:40:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.210016
Title: Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models
Title（参考訳）: ボットネックを打破する: 符号化モデルのための効果的で安定した強化学習
Authors: Zongqian Li, Shaohan Huang, Zewen Chi, Yixuan Su, Lexin Zhou, Li Dong, Nigel Collier, Furu Wei,
Abstract要約: 現代のコード生成モデルは、より長い出力を示し、能力の成長を加速し、トレーニングのダイナミクスを変更します。グループ相対政策最適化を改良したMicroCoder-GRPOを提案する。 MicroCoder-GRPOは、LiveCodeBench v6の強力なベースラインよりも17.6%の相対的な改善を実現している。私たちは、300のトレーニングステップで、LiveCodeBench v6の主流データセットよりも3倍大きなパフォーマンス向上を達成する、より困難なトレーニングコーパスであるMicroCoder-Datasetをリリースした。
参考スコア（独自算出の注目度）: 104.26904744478884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern code generation models exhibit longer outputs, accelerated capability growth, and changed training dynamics, rendering traditional training methodologies, algorithms, and datasets ineffective for improving their performance. To address these training bottlenecks, we propose MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity. MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation. Additionally, we release MicroCoder-Dataset, a more challenging training corpus that achieves 3x larger performance gains than mainstream datasets on LiveCodeBench v6 within 300 training steps, and MicroCoder-Evaluator, a robust framework with approximately 25% improved evaluation accuracy and around 40% faster execution. Through comprehensive analysis across more than thirty controlled experiments, we reveal 34 training insights across seven main aspects, demonstrating that properly trained models can achieve competitive performance with larger counterparts.
Abstract（参考訳）: 現代のコード生成モデルは、より長いアウトプットを示し、能力の成長を加速し、トレーニングのダイナミクスを変更し、従来のトレーニング方法論、アルゴリズム、データセットを非効率にレンダリングする。これらのトレーニングボトルネックに対処するために,MicroCoder-GRPOを提案する。これは,トレーニング安定性を維持しながら長期出力電位を向上するための条件付きトランケーションマスキング,出力多様性の維持と促進を目的とした多様性決定温度選択,高クリッピング比によるKL損失の除去という,3つのイノベーションによるグループ相対政策最適化手法である。 MicroCoder-GRPOは、LiveCodeBench v6の強いベースラインよりも17.6%の相対的な改善を実現している。さらに、300のトレーニングステップでLiveCodeBench v6の主流データセットよりも3倍大きなパフォーマンス向上を達成する、より困難なトレーニングコーパスであるMicroCoder-Datasetと、約25%改善された評価精度と約40%高速な実行が可能な堅牢なフレームワークであるMicroCoder-Evaluatorをリリースしています。 30以上の制御された実験を総合的に分析することにより、7つの主要な側面にわたる34のトレーニングインサイトを明らかにし、適切なトレーニングされたモデルがより大きなモデルと競合する性能を達成できることを実証する。

論文の概要: Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models

関連論文リスト