Fugu-MT 論文翻訳(概要): RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning

論文の概要: RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.00790v1
Date: Wed, 01 Apr 2026 11:54:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.96803
Title: RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning
Title（参考訳）: RefineRL: 自己強化強化学習による競争的プログラミングの促進
Authors: Shaopeng Fu, Xingxing Zhang, Li Dong, Di Wang, Furu Wei,
Abstract要約: RefineRLは、競合するプログラミング問題に対して、大規模言語モデルの自己精製能力を解き放つために設計された新しいアプローチである。 Skeptical-Agentは、CP問題の公開テストケースに対して生成されたソリューションを検証するためのローカル実行ツールを備えた反復的な自己修復エージェントである。強化学習ソリューションは、標準RLVRデータのみを用いてLLMを自己精製にインセンティブを与える。
参考スコア（独自算出の注目度）: 63.432969627395686
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models (LLMs) have demonstrated strong performance on complex reasoning tasks such as competitive programming (CP), existing methods predominantly focus on single-attempt settings, overlooking their capacity for iterative refinement. In this paper, we present RefineRL, a novel approach designed to unleash the self-refinement capabilities of LLMs for CP problem solving. RefineRL introduces two key innovations: (1) Skeptical-Agent, an iterative self-refinement agent equipped with local execution tools to validate generated solutions against public test cases of CP problems. This agent always maintains a skeptical attitude towards its own outputs and thereby enforces rigorous self-refinement even when validation suggests correctness. (2) A reinforcement learning (RL) solution to incentivize LLMs to self-refine with only standard RLVR data (i.e., problems paired with their verifiable answers). Extensive experiments on Qwen3-4B and Qwen3-4B-2507 demonstrate that our method yields substantial gains: after our RL training, these compact 4B models integrated with the Skeptical-Agent not only outperform much larger 32B models but also approach the single-attempt performance of 235B models. These findings suggest that self-refinement holds considerable promise for scaling LLM reasoning, with significant potential for further advancement.
Abstract（参考訳）: 大規模言語モデル(LLM)は、競合プログラミング(CP)のような複雑な推論タスクにおいて強力な性能を示してきたが、既存の手法は、反復的洗練のための能力を見越して、主に単一目的の設定に焦点を当てている。本稿では,CP 問題解決のための LLM の自己補充能力を解き放つ新しい手法である RefineRL を提案する。 1) CP問題の公的なテストケースに対して生成したソリューションを検証するための,局所的な実行ツールを備えた反復的自己複製エージェントであるScieptical-Agent。このエージェントは、常に自身のアウトプットに対する懐疑的な態度を維持し、検証が正確であることを示唆しても厳格な自己抑制を強制する。 2) 標準RLVRデータ(つまり、検証可能な解と組み合わせた問題)のみを用いてLLMを自己精製する強化学習(RL)ソリューション。 Qwen3-4B と Qwen3-4B-2507 の大規模な実験により,我々のRL トレーニングの後,これらのコンパクト 4B モデルは,より大きな32B モデルを上回るだけでなく,235B モデルの単一回避性能にも近づいた。これらの結果から,自己補充はLSM推論のスケーリングに有意な可能性を秘めており,さらなる進展の可能性が示唆された。

論文の概要: RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning

関連論文リスト