Fugu-MT 論文翻訳(概要): CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

論文の概要: CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

arxiv url: http://arxiv.org/abs/2605.23491v2
Date: Mon, 25 May 2026 03:01:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 16:32:38.058906
Title: CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test
Title（参考訳）: CoSPlay: 自己生成コードと単体テストによるテスト時の協調的なセルフプレイ
Authors: Zhangyi Hu, Chenhui Liu, Tian Huang, Jindong Li, Yang Yang, Jiemin Wu, Zining Zhong, Menglin Yang, Yutao Yue,
Abstract要約: CoSPlayはGTフリーでトレーニング不要なフレームワークで、協調的なセルフプレイを通じてコードとUTを共同で改善する。まず、多様なソリューションのアイデアを探求し、差別的なUTのアイデアを生み出すための潜在的な失敗モードを特定します。次に、Code-UT実行マトリックスからの双方向のパスカウント信号を使用して、弱いコードを繰り返しプーンしたり修正したり、信頼性の低いUTをリフレッシュしたり、置き換えたりする。
参考スコア（独自算出の注目度）: 11.070705548910636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.
Abstract（参考訳）: 近年,RLVR (Reinforcement Learning with Verifiable Rewards) とTTS (Test-Time Scaling) が,実行可能検証によるLCMコード生成の進歩を実現している。しかし、GT UTは依然としてボトルネックであり、SOTA RLVR法はコストのかかる訓練を必要とするが、既存のTS法はそれらを使わずに競争力を失う。これは GT フリー TTS のモチベーションであり、既存のメソッドはコード候補の洗練と選択に自己生成UTを直接使用する。しかし、このようなUTは、しばしば騒々しいか、または突然、間違ったコードと結合し、UTの品質は、信頼できるコードなしで検証できない。したがって、大きな課題は両方を共同で改善することである。この目的のために,協調的なセルフプレイを通じてコードとUTを共同で改善する,GTフリーでトレーニング不要なフレームワークであるCoSPlayを提案する。まず、多様なソリューションのアイデアを探求し、差別的なUTのアイデアを生み出すための潜在的な失敗モードを特定します。次に、Code-UT実行マトリクスからの双方向のパスカウント信号を使用して、繰り返しプーンしたり、弱いコードを修正したり、信頼できないUTをリフレッシュしたり、置き換えたりすることで、2つのプールを共進化させる。最後に、複数のコードが最高パス数で結びついている場合、間違ったコードが分岐している間に正しいコードが同じ入力に一致するため、最大の出力合意クラスタから最終コードを選択する。 4つの挑戦的なベンチマーク実験の結果、Qwen2.5-7B-InstructのCoSPlayは平均BoNを22.1%から33.2%に改善し、UT精度を14.6%から78.3%に改善し、RLVRモデルCURE-7Bに適合または超えた。 CURE-7Bに適用すると、BoNをさらに5.7%改善する。 CoSPlayはまた、さまざまなバックボーンをまたいで一般化し、GTフリーのTSベースラインを同等のトークン予算で上回る。これらの結果から,GTデータを持たない競合コード生成のためのスケーラブルな推論戦略が示唆された。

論文の概要: CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

関連論文リスト