BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution
Abstract Overview
BenchEvolver is a solution-centric evolutionary framework that converts existing coding tasks into harder variants by first mutating reference solutions and then deriving corresponding statements and tests. The method combines a proposer, benchmark-specific validation, empirical difficulty measurement against target models, and memory-guided search to ensure accepted tasks are both executable and genuinely more challenging. Across LiveCodeBench and SciCode, the paper reports high post-hoc validity for accepted tasks and consistent reductions in model pass rates, including for the same models that generated the tasks. The authors also use these evolved tasks to build LiveCodeBench-Plus, a 91-problem benchmark for frontier coding models, and to provide reinforcement-learning signal for model self-improvement.
Novelty
The paper’s main novelty is a solution-first task evolution pipeline that grounds synthesis in executable semantics rather than generating problem statements first. It also frames benchmark evolution as a closed-loop self-challenging process, where tasks are selected by empirical model failure and can later be reused as training signal for the same model family.
Results
On LiveCodeBench-Plus, frontier-model Pass@1 spans 27.5% to 62.6%, indicating restored discrimination after benchmark saturation; on the source hard split, average pass@1 drops from 87.0% on seeds to 45.7% on evolved tasks. The framework also produces harder validated tasks in SciCode and improves RL training outcomes: for gpt-oss-20b, seed+evolved training yields +8.7 points on LCB v6 Hard and +8.3 points on LCB-Pro Easy, outperforming seed-only training, while evolved-task training transfers to an independently evolved benchmark.
Key Points
- BenchEvolver evolves reference solutions before writing statements and tests, then filters candidates through benchmark-specific validation, empirical difficulty checks, and memory-guided search.
- The method generates valid, substantially harder tasks across competitive-programming and scientific-coding benchmarks, and the resulting LiveCodeBench-Plus benchmark preserves meaningful separation among strong coding models.
- Evolved tasks are not only harder evaluation items; they also function as reusable RL training signal that improves held-out coding performance for the same model family that generated them.