Related papers: Scaling Agentic Verifier for Competitive Coding

Scaling Agentic Verifier for Competitive Coding

URL: http://arxiv.org/abs/2602.04254v1
Date: Wed, 04 Feb 2026 06:30:40 GMT
Title: Scaling Agentic Verifier for Competitive Coding
Authors: Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang, Jiajun Zhang, Yuheng Jing, Lei Zhang, Hao Zheng, Wenting Zhao, Junyang Lin, Binyuan Hui,
Abstract summary: Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt.<n>Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling.<n>We propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs.
Score: 66.11758166379092
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier's broader potential beyond reranking.

Related papers

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners [69.66089681814013]
$V_$ is a framework that unifies generation and verification through efficient pairwise ranking.<n>$V_$-Infer improves Pass@1 by up to $10%$ over pointwise verification.<n>$V_$-PairRL achieves $7$--$9%$ test-time scaling gains over standard RL and pointwise joint training.
arXiv Detail & Related papers (2026-03-04T17:22:16Z)
MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning [19.054149750597933]
MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning) is a framework that shifts the focus to "scaling-by-utility"<n>We introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions.<n>Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T03:22:44Z)
CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions [8.163435280190027]
Existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass.<n>CodeHacker generates adversarial test cases that expose latent vulnerabilities in program submissions.<n>Experiments demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets.
arXiv Detail & Related papers (2026-02-23T05:59:30Z)
CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z)
BOSQTGEN: Breaking the Sound Barrier in Test Generation [3.052470294814771]
We introduce BOSQTGEN, a novel black-box and tool for API test generation.<n> BOSQTGEN utilizes a novel approach for decomposing API specifications into primitives, using LLMs to suggest coherent interactions for them, and employing testing to efficiently sample over these values.<n>The resulting BOSQTGEN system achieves an average of 82% of critical code coverage on benchmarks, often a 20% or more increase over prior state-of-the-art systems.
arXiv Detail & Related papers (2025-10-22T17:11:30Z)
Budget-aware Test-time Scaling via Discriminative Verification [29.169164125933538]
Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks.<n>In this work, we shift the focus to a more budget-aware paradigm: discriminative verification.<n>Under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin.
arXiv Detail & Related papers (2025-10-16T17:30:02Z)
Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking [54.43083499412643]
Test-time algorithms that combine the generative power of language models with process verifiers offer a promising lever for eliciting new reasoning capabilities.<n>We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors.
arXiv Detail & Related papers (2025-10-03T16:21:14Z)
Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z)
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.<n>It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.<n>This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z)
Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation [27.484259938667776]
Large Language Models excel at code generation yet struggle with complex programming tasks that demand reasoning.<n>We introduce Outcome Refining Process Supervision, which unifies process and outcome supervision by leveraging executable verification.<n>Experiments across 5 models and 3 benchmarks show consistent gains, with 26.9% higher correctness and 42.2% improved code efficiency.
arXiv Detail & Related papers (2024-12-19T17:59:42Z)
On Speeding Up Language Model Evaluation [48.51924035873411]
We propose an $textitadaptive$ approach to explore this space.<n>We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate.<n>We show that it can identify the top-performing method using only 5-15% of the typical resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z)
Generating and Detecting True Ambiguity: A Forgotten Danger in DNN Supervision Testing [8.210473195536077]
We propose a novel way to generate ambiguous inputs to test Deep Neural Networks (DNNs) In particular, we propose AmbiGuess to generate ambiguous samples for image classification problems. We find that those best suited to detect true ambiguity perform worse on invalid, out-of-distribution and adversarial inputs and vice-versa.
arXiv Detail & Related papers (2022-07-21T14:21:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.