ROC-n-reroll: How verifier imperfection affects test-time scaling
- URL: http://arxiv.org/abs/2507.12399v1
- Date: Wed, 16 Jul 2025 16:44:29 GMT
- Title: ROC-n-reroll: How verifier imperfection affects test-time scaling
- Authors: Florian E. Dorner, Yatong Chen, André F. Cruz, Fanny Yang,
- Abstract summary: Test-time scaling aims to improve language model performance by leveraging additional compute during inference.<n>There is little theoretical understanding of how verifier imperfection affects performance.<n>We prove how instance-level accuracy of methods is precisely characterized by the geometry of the verifier's ROC curve.
- Score: 10.949594516629652
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Test-time scaling aims to improve language model performance by leveraging additional compute during inference. While many works have empirically studied techniques like Best-of-N (BoN) and rejection sampling that make use of a verifier to enable test-time scaling, there is little theoretical understanding of how verifier imperfection affects performance. In this work, we address this gap. Specifically, we prove how instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Interestingly, while scaling is determined by the local geometry of the ROC curve for rejection sampling, it depends on global properties of the ROC curve for BoN. As a consequence when the ROC curve is unknown, it is impossible to extrapolate the performance of rejection sampling based on the low-compute regime. Furthermore, while rejection sampling outperforms BoN for fixed compute, in the infinite-compute limit both methods converge to the same level of accuracy, determined by the slope of the ROC curve near the origin. Our theoretical results are confirmed by experiments on GSM8K using different versions of Llama and Qwen to generate and verify solutions.
Related papers
- On the Wasserstein Convergence and Straightness of Rectified Flow [54.580605276017096]
Rectified Flow (RF) is a generative model that aims to learn straight flow trajectories from noise to data.<n>We provide a theoretical analysis of the Wasserstein distance between the sampling distribution of RF and the target distribution.<n>We present general conditions guaranteeing uniqueness and straightness of 1-RF, which is in line with previous empirical findings.
arXiv Detail & Related papers (2024-10-19T02:36:11Z) - The curious case of the test set AUROC [0.5242869847419834]
We argue that considering scores derived from the test ROC curve alone gives only a narrow insight into how a model performs and its ability to generalise.
arXiv Detail & Related papers (2023-12-19T17:40:58Z) - The Lipschitz-Variance-Margin Tradeoff for Enhanced Randomized Smoothing [85.85160896547698]
Real-life applications of deep neural networks are hindered by their unsteady predictions when faced with noisy inputs and adversarial attacks.
We show how to design an efficient classifier with a certified radius by relying on noise injection into the inputs.
Our novel certification procedure allows us to use pre-trained models with randomized smoothing, effectively improving the current certification radius in a zero-shot manner.
arXiv Detail & Related papers (2023-09-28T22:41:47Z) - Neural Operator Variational Inference based on Regularized Stein
Discrepancy for Deep Gaussian Processes [23.87733307119697]
We introduce Neural Operator Variational Inference (NOVI) for Deep Gaussian Processes.
NOVI uses a neural generator to obtain a sampler and minimizes the Regularized Stein Discrepancy in L2 space between the generated distribution and true posterior.
We demonstrate that the bias introduced by our method can be controlled by multiplying the divergence with a constant, which leads to robust error control and ensures the stability and precision of the algorithm.
arXiv Detail & Related papers (2023-09-22T06:56:35Z) - Collapsed Inference for Bayesian Deep Learning [36.1725075097107]
We introduce a novel collapsed inference scheme that performs Bayesian model averaging using collapsed samples.
A collapsed sample represents uncountably many models drawn from the approximate posterior.
Our proposed use of collapsed samples achieves a balance between scalability and accuracy.
arXiv Detail & Related papers (2023-06-16T08:34:42Z) - Learning for Transductive Threshold Calibration in Open-World Recognition [83.35320675679122]
We introduce OpenGCN, a Graph Neural Network-based transductive threshold calibration method with enhanced robustness and adaptability.
Experiments across open-world visual recognition benchmarks validate OpenGCN's superiority over existing posthoc calibration methods for open-world threshold calibration.
arXiv Detail & Related papers (2023-05-19T23:52:48Z) - Neural Inference of Gaussian Processes for Time Series Data of Quasars [72.79083473275742]
We introduce a new model that enables it to describe quasar spectra completely.
We also introduce a new method of inference of Gaussian process parameters, which we call $textitNeural Inference$.
The combination of both the CDRW model and Neural Inference significantly outperforms the baseline DRW and MLE.
arXiv Detail & Related papers (2022-11-17T13:01:26Z) - Error-based Knockoffs Inference for Controlled Feature Selection [49.99321384855201]
We propose an error-based knockoff inference method by integrating the knockoff features, the error-based feature importance statistics, and the stepdown procedure together.
The proposed inference procedure does not require specifying a regression model and can handle feature selection with theoretical guarantees.
arXiv Detail & Related papers (2022-03-09T01:55:59Z) - The $f$-divergence and Loss Functions in ROC Curve [2.9823962001574182]
Given two data distributions and a test score function, the Receiver Operating Characteristic (ROC) curve shows how well such a score separates two distributions.
Can the ROC curve be used as a measure of discrepancy between two distributions?
This paper shows that when the data likelihood ratio is used as the test score, the arc length of the ROC curve gives rise to a novel $f$-divergence measuring the differences between two data distributions.
arXiv Detail & Related papers (2021-10-18T23:12:35Z) - Calibration of Neural Networks using Splines [51.42640515410253]
Measuring calibration error amounts to comparing two empirical distributions.
We introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test.
Our method consistently outperforms existing methods on KS error as well as other commonly used calibration measures.
arXiv Detail & Related papers (2020-06-23T07:18:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.