Fugu-MT 論文翻訳(概要): DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

論文の概要: DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

arxiv url: http://arxiv.org/abs/2511.06307v1
Date: Sun, 09 Nov 2025 10:11:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-11 21:18:44.878753
Title: DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation
Title（参考訳）: DRIVE:競争力のあるコード生成における検証可能なリワードによる強化学習のためのデータキュレーションベストプラクティス
Authors: Speed Zhu, Jianwei Cai, Guang Chen, Lulu Wu, Saiyong Yang, Wiggin Zhou,
Abstract要約: 我々は、RLVR(すなわち、RLプロンプト)を構築し、競争プログラミングコード生成に強力なパフォーマンスをもたらすトレーニング技術を示す。本手法はQwen2.5-32B上で実装され,LeetCodeとCodeforcesの毎週のコンテストでデータ漏洩を回避する。結果として得られたモデルは、同様のスケールのモデル間で最先端のパフォーマンスを実現し、DeepSeek v3.1 や Doubao-1.5-Thinking のような主要なシステムに匹敵する。
参考スコア（独自算出の注目度）: 5.496363733566038
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent reasoning-first models (e.g., OpenAI o1, DeepSeek R1) have spurred a resurgence of interest in RLVR. Nevertheless, advances are dominated by mathematics (e.g., AIME), with competitive-programming code generation underexplored and data curation receiving less attention than RL algorithm design. We investigate how to construct RLVR datasets (i.e., RL prompts) and present practical training techniques that yield strong performance on competitive-programming code generation. Our pipeline begins with supervised fine-tuning (SFT) distilled from strong open-source models, augmented with general-purpose and reasoning-intensive data. RL then follows a two-stage process with executable, testcase-driven rewards: first, training on a large, uniformly distributed set of competitive-programming problems using Group Relative Policy Optimization (GRPO) with 8 rollouts per prompt and a relatively short response-generation window (e.g., 32k during SFT and 24k in this stage) to expand entropy and mitigate repetition and truncation; second, we perform \textbf{Pre-GRPO}: updating on a small, high-quality set of challenging problems with a large rollout budget (64 rollouts per prompt) under a hard-focus curriculum that continuously retains the most difficult instances throughout training. We implement our method on Qwen2.5-32B and evaluate on LeetCode and Codeforces weekly contests to avoid data leakage. The resulting model achieves state-of-the-art performance among models of similar scale and is comparable to leading systems such as DeepSeek v3.1 and Doubao-1.5-Thinking. We also examine scaling trends and observe strong RL scaling on an internal large-scale MoE model. Our study distills concise best practices for data curation, entropy expansion, and curriculum design in RLVR for competitive-programming code generation.
Abstract（参考訳）: 最近の推論ファーストモデル(OpenAI o1、DeepSeek R1)は、RLVRへの関心を復活させた。しかし、進歩は数学(例えばAIME)に支配されており、競合プログラミングのコード生成は未探索であり、データキュレーションはRLアルゴリズムの設計よりも注目度が低い。本稿では、RLVRデータセット(RLプロンプト)の構築方法と、競合プログラミングコード生成に強力な性能をもたらす実践的トレーニング手法について検討する。我々のパイプラインは、強力なオープンソースモデルから抽出された教師付き微調整(SFT)から始まり、汎用的および推論集約的なデータで強化される。次にRLは、実行可能でテストケース駆動の報奨を伴う2段階のプロセスに従う: まず、グループ相対ポリシー最適化(GRPO)を使用して、大規模な、均一に分散した競合プログラミング問題のセットをトレーニングする。グループ相対ポリシー最適化(GRPO)は、プロンプト毎に8回ロールアウトし、比較的短い応答生成ウィンドウ(例えば、SFT中32k、この段階では24k)を使用して、エントロピーを拡張し、繰り返しと停止を緩和する。本手法はQwen2.5-32B上で実装され,LeetCodeとCodeforcesの毎週のコンテストでデータ漏洩を回避する。結果として得られたモデルは、同様のスケールのモデル間で最先端のパフォーマンスを実現し、DeepSeek v3.1 や Doubao-1.5-Thinking のような主要なシステムに匹敵する。また,スケーリングの傾向を考察し,内部の大規模MoEモデル上で強力なRLスケーリングを観測する。本研究は、RLVRにおけるデータキュレーション、エントロピー展開、カリキュラム設計のための簡潔なベストプラクティスを、競合プログラミングコード生成のために精査する。

論文の概要: DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

関連論文リスト