Fugu-MT 論文翻訳(概要): ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

論文の概要: ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

arxiv url: http://arxiv.org/abs/2604.07864v1
Date: Thu, 09 Apr 2026 06:24:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.737946
Title: ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
Title（参考訳）: ZeroCoder: LLMはゼロトルース・スーパービジョンなしでコード生成を改善できるか?
Authors: Lishui Fan, Mouxiang Chen, Tingwei Zhu, Kui Liu, Xin Xia, Shanping Li, Zhongxin Liu,
Abstract要約: RLVR(Reinforcement Learning with Verifiable Rewards)は、実行ベースのフィードバックを通じて改善する強力なパラダイムである。既存の作業では、自己生成テストを使って報酬を土台にしようとしたが、差別的テストの欠如は、テスト生成に対するモデルの準最適性能による影響を制限した。我々はZeroCoderについて紹介する。ZeroCoderは、自己生成されたコード-テストインタラクションから実行フィードバックを使用して、CoderとTesterを共同でトレーニングする完全なラベルなしの共進化フレームワークである。
参考スコア（独自算出の注目度）: 13.984583399745157
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code generation is important in software engineering, and Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm to improve it through execution-based feedback. However, most RLVR pipelines rely on human-curated tests, making progress bottlenecked by scarce and costly supervision. Existing work tried to use self-generated tests to ground rewards, but the lack of discriminative tests constrains the effect due to the sub-optimal performance of the model on test generation. We aim to improve code generation without ground-truth supervision by co-evolving code and test generation, so that their interactions yield progressively more informative supervision. To this end, we present ZeroCoder, a fully label-free co-evolutionary framework that jointly trains a Coder and a Tester using execution feedback from self-generated code-test interactions. For each problem, ZeroCoder executes sampled solutions against sampled tests to form a passing matrix, identifies a consensus subset of likely-correct solutions and consistent tests via a pluggable selection algorithm, and derives role-specific rewards. To ensure reward quality, ZeroCoder filters low-information instances via rank-based pre-filtering and trains the Tester with a curriculum balancing validity and mutation-driven discriminativeness. We further identify selector drift, the progressive miscalibration of fixed selection rules during co-evolution, and introduce DyB4, a Bayesian selector that uses as few as 10 labeled instances to recalibrate its priors dynamically. Across three models and six benchmarks, ZeroCoder consistently improves code generation and test generation. In the fully label-free setting, it improves code generation by up to 14.5% over the base model on Qwen2.5-Coder-7B-Instruct. With DyB4, the gain reaches 21.6%, while test generation improves by 24.3%, approaching oracle-supervised performance.
Abstract（参考訳）: コード生成はソフトウェアエンジニアリングにおいて重要であり、RLVR(Reinforcement Learning with Verifiable Rewards)は実行ベースのフィードバックを通じてそれを改善するための強力なパラダイムである。しかしながら、ほとんどのRLVRパイプラインは人為的なテストに依存しており、少ない監視とコストのかかる監視によって進捗をボトルネックにしている。既存の作業では、自己生成テストを使って報酬を土台にしようとしたが、差別的テストの欠如は、テスト生成に対するモデルの準最適性能による影響を制限した。コードとテスト生成を共同で進化させることで、根底からの監督なしにコード生成を改善することを目指しており、それらの相互作用が徐々により情報的な監督をもたらすようにしている。この目的のために、ZeroCoderを紹介します。ZeroCoderは、自己生成されたコード-テストインタラクションから実行フィードバックを使用して、CoderとTesterを共同でトレーニングする、完全なラベルなしの共進化フレームワークです。それぞれの問題に対して、ZeroCoderはサンプルテストに対するサンプルソリューションを実行し、パスマトリックスを形成し、プラグイン可能な選択アルゴリズムを通じて、潜在的に正しいソリューションと一貫したテストのコンセンサスサブセットを特定し、ロール固有の報酬を導出する。報酬品質を保証するため、ZeroCoderはランクベースの事前フィルタリングを通じて低情報インスタンスをフィルタリングし、妥当性と突然変異駆動の識別性のバランスをとるカリキュラムでテスターを訓練する。我々はさらにセレクタドリフト、共進化中の固定選択規則の漸進的誤校正、および10個のラベル付きインスタンスを動的に再校正するベイズセレクタDyB4を紹介する。 3つのモデルと6つのベンチマークで、ZeroCoderは一貫してコード生成とテスト生成を改善している。完全なラベルのない設定では、Qwen2.5-Coder-7B-Instructのベースモデルよりも14.5%もコード生成を改善する。 DyB4では、利得は21.6%に達し、テスト生成は24.3%向上し、オラクルが監督するパフォーマンスに近づいた。

論文の概要: ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

関連論文リスト