Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput
- URL: http://arxiv.org/abs/2506.10056v1
- Date: Wed, 11 Jun 2025 17:58:21 GMT
- Title: Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput
- Authors: Gabriel Orlanski, Nicholas Roberts, Aws Albarghouthi, Frederic Sala,
- Abstract summary: We show that an outcome reward model (ORM) plays a crucial role in scaling verification through trading accuracy for speed.<n>We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions.
- Score: 21.59519440154879
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The standard paradigm for solving coding tasks via large language models (LLMs) is to generate-then-rank programs, where the latter step uses a verifier in the ranking process. The growing consensus is that a comprehensive verifier (e.g., a full test suite) should be prioritized over an outcome reward model (ORM) whenever possible, with little consideration given to the trade-offs involved. We aim to challenge this assumption by systematically exploring the tradeoff between speed and accuracy. We find that ORMs play a crucial role in scaling verification through trading accuracy for speed, even when a comprehensive verifier is available. Their value becomes especially apparent when used in a generate-prune-then-rank approach, where a faster but less accurate verifier removes incorrect solutions prior to ranking -- leading to a system that is 11.65x faster while only being 8.33% less accurate than the full test suite. We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions. These findings enable the design of scalable and accurate program ranking systems.
Related papers
- CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - Probing for Arithmetic Errors in Language Models [86.8227317662622]
Internal activations in language models can be used to detect arithmetic errors.<n>We show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states.<n>We train lightweight error detectors that predict model correctness with over 90% accuracy.
arXiv Detail & Related papers (2025-07-16T16:27:50Z) - Towards Robust Fact-Checking: A Multi-Agent System with Advanced Evidence Retrieval [1.515687944002438]
The rapid spread of misinformation in the digital era poses significant challenges to public discourse.<n>Traditional human-led fact-checking methods, while credible, struggle with the volume and velocity of online content.<n>This paper proposes a novel multi-agent system for automated fact-checking that enhances accuracy, efficiency, and explainability.
arXiv Detail & Related papers (2025-06-22T02:39:27Z) - Reinforcement Speculative Decoding for Fast Ranking [9.584558586988953]
Large Language Models (LLMs) have been widely adopted in ranking systems such as information retrieval (IR) systems and recommender systems (RSs)<n>We propose a Reinforcementive Decoding method for fast ranking inference of LLMs.
arXiv Detail & Related papers (2025-05-23T02:25:26Z) - Search-Based Correction of Reasoning Chains for Language Models [72.61861891295302]
Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs)<n>We introduce a new self-correction framework that augments each reasoning step in a CoT with a latent variable indicating its veracity.<n>We also introduce Search Corrector, a discrete search algorithm over-valued veracity assignments.
arXiv Detail & Related papers (2025-05-17T04:16:36Z) - MFH: A Multi-faceted Heuristic Algorithm Selection Approach for Software Verification [23.80925841520252]
We propose an automated algorithm selection approach, namely MFH, for software verification.<n>MFH embeds the code property graph (CPG) of a semantic-preserving transformed program to enhance the robustness of the prediction model.<n>We evaluate MFH on 20 verifiers and over 15,000 verification tasks.
arXiv Detail & Related papers (2025-03-28T08:21:00Z) - Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification [35.347715518778095]
We study the scaling trends governing sampling-based search.<n>We find that simply scaling up a minimalist implementation of sampling-based search provides a practical inference method.<n>We identify two useful principles for improving self-verification capabilities with test-time compute.
arXiv Detail & Related papers (2025-02-03T21:31:07Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision [40.984680166762345]
We introduce Model-induced Process Supervision (MiPS), a novel method for automating data curation.
MiPS annotates an intermediate step by sampling completions of this solution through the reasoning model, and obtaining an accuracy defined as the proportion of correct completions.
Our approach significantly improves the performance of PaLM 2 on math and coding tasks.
arXiv Detail & Related papers (2024-02-05T00:57:51Z) - GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large Language Model [59.495717939664246]
We propose a gold label-agnostic prompt evaluation (GLaPE) to alleviate dependence on gold labels.<n>We show that GLaPE provides reliable evaluations with accuracy, even in the absence of gold labels.<n>On six popular reasoning tasks, our GLaPE-based prompt optimization yields effective prompts comparable to accuracy-based ones.
arXiv Detail & Related papers (2024-02-04T08:57:54Z) - Factual Error Correction for Abstractive Summaries Using Entity
Retrieval [57.01193722520597]
We propose an efficient factual error correction system RFEC based on entities retrieval post-editing process.
RFEC retrieves the evidence sentences from the original document by comparing the sentences with the target summary.
Next, RFEC detects the entity-level errors in the summaries by considering the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences.
arXiv Detail & Related papers (2022-04-18T11:35:02Z) - Optimal Change-Point Detection with Training Sequences in the Large and
Moderate Deviations Regimes [72.68201611113673]
This paper investigates a novel offline change-point detection problem from an information-theoretic perspective.
We assume that the knowledge of the underlying pre- and post-change distributions are not known and can only be learned from the training sequences which are available.
arXiv Detail & Related papers (2020-03-13T23:39:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.