FuguReport

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Authors Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao, Ziheng Zhou, Mert Cemri, Shu Liu, Yuran Xiu, Chenxiao Yan, Haikun Zhao, Bin Yu, Ion Stoica, Dawn Song
Affiliations University of California, Berkeley / Tsinghua University
Categories Method / Benchmarking / Synthesis of harder task variants, Application / Training Signal Design / Reusable signals from benchmark evolution, Evaluation / Model Evaluation / Frontier-level evaluation suite creation
License CC BY 4.0

Abstract Overview

BenchEvolver is a solution-centric evolutionary framework that converts existing coding tasks into harder variants by first mutating reference solutions and then deriving corresponding statements and tests. The method combines a proposer, benchmark-specific validation, empirical difficulty measurement against target models, and memory-guided search to ensure accepted tasks are both executable and genuinely more challenging. Across LiveCodeBench and SciCode, the paper reports high post-hoc validity for accepted tasks and consistent reductions in model pass rates, including for the same models that generated the tasks. The authors also use these evolved tasks to build LiveCodeBench-Plus, a 91-problem benchmark for frontier coding models, and to provide reinforcement-learning signal for model self-improvement.

Novelty

The paper’s main novelty is a solution-first task evolution pipeline that grounds synthesis in executable semantics rather than generating problem statements first. It also frames benchmark evolution as a closed-loop self-challenging process, where tasks are selected by empirical model failure and can later be reused as training signal for the same model family.

Results

On LiveCodeBench-Plus, frontier-model Pass@1 spans 27.5% to 62.6%, indicating restored discrimination after benchmark saturation; on the source hard split, average pass@1 drops from 87.0% on seeds to 45.7% on evolved tasks. The framework also produces harder validated tasks in SciCode and improves RL training outcomes: for gpt-oss-20b, seed+evolved training yields +8.7 points on LCB v6 Hard and +8.3 points on LCB-Pro Easy, outperforming seed-only training, while evolved-task training transfers to an independently evolved benchmark.

Key Points

  1. BenchEvolver evolves reference solutions before writing statements and tests, then filters candidates through benchmark-specific validation, empirical difficulty checks, and memory-guided search.
  2. The method generates valid, substantially harder tasks across competitive-programming and scientific-coding benchmarks, and the resulting LiveCodeBench-Plus benchmark preserves meaningful separation among strong coding models.
  3. Evolved tasks are not only harder evaluation items; they also function as reusable RL training signal that improves held-out coding performance for the same model family that generated them.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.