Fugu-MT 論文翻訳(概要): BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

論文の概要: BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

arxiv url: http://arxiv.org/abs/2606.01286v1
Date: Sun, 31 May 2026 15:12:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.498199
Title: BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution
Title（参考訳）: BenchEvolver: ソリューション中心進化によるフロンティアタスク合成
Authors: Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao, Ziheng Zhou, Mert Cemri, Shu Liu, Yuran Xiu, Chenxiao Yan, Haikun Zhao, Bin Yu, Ion Stoica, Dawn Song,
Abstract要約: BenchEvolverはソリューション中心の進化的フレームワークで、既存のコーディング問題をより難しい変種に変換する。 BenchEvolverは、飽和ベンチマークをフロンティアレベルの評価スイートと再利用可能なトレーニング信号に変換することができることを示す。
参考スコア（独自算出の注目度）: 59.619780973577434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.
Abstract（参考訳）: よりフロンティアな大規模言語モデルの急速な進歩は、広範なベンチマーク飽和をもたらし、既存のデータセットがモデルの能力を差別化したり、有用なトレーニング信号を提供する能力を制限する。例えばLiveCodeBenchでは、フロンティアモデルは簡単に分割できるPass@1を99%以上達成し、難易度で平均90%以上のPass@1を達成する。新しい、挑戦的なデータセットを構築するには、通常、相当な人的努力が必要であり、進捗のボトルネックを生み出します。ソリューション中心の進化的フレームワークであるBenchEvolverを紹介します。ベンチエボルバーは、スクラッチから問題を発生させるのではなく、構造化変換を通じて参照解を進化させ、進化した解から対応するステートメントとテストを引き出す。この設計は、実行可能セマンティクスで生成し、検証可能な正確さで高品質で多種多様で困難なタスクをスケーラブルに構築することを可能にする。 BenchEvolverをLiveCodeBenchとSciCodeに適用すると、妥当性、参照の正確性、多様性を維持しながら、かなり難しい進化したタスクが得られます。さらに、91プロブレムのベンチマークであるLiveCodeBench-Plusを、フロンティアモデルのPass@1が27.5%から62.6%の範囲で進化し、難易度の高いLCB-v6タスクを組み合わせることで、強力なコーディングモデルの明確な差別を回復する。重要なことは、進化したタスクは、それらを生成するモデルでさえも困難なままであり、自己改善を可能にします。 gpt-oss-20bでは, LCB v6 Hard と LCB-Pro Easy で+8.7 と +8.3 Pass@1 が, それぞれ 70.7% と 34.8% を上回った。以上の結果から,BenchEvolverは飽和ベンチマークをフロンティアレベルの評価スイートと再利用可能なトレーニング信号に変換することができることがわかった。

論文の概要: BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

関連論文リスト