Related papers: Efficient Prediction of Pass@k Scaling in Large Language Models

Efficient Prediction of Pass@k Scaling in Large Language Models

URL: http://arxiv.org/abs/2510.05197v1
Date: Mon, 06 Oct 2025 17:42:27 GMT
Title: Efficient Prediction of Pass@k Scaling in Large Language Models
Authors: Joshua Kazdan, Rylan Schaeffer, Youssef Allouah, Colin Sullivan, Kyssen Yu, Noam Levi, Sanmi Koyejo,
Abstract summary: We show how to accurately predict a model's behavior when scaled to a massive number of attempts.<n>Standard methods for fitting these laws suffer from statistical shortcomings that hinder predictive accuracy.<n>We propose a dynamic sampling strategy that allocates a greater budget to harder problems.
Score: 34.59142976295509
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assessing the capabilities and risks of frontier AI systems is a critical area of research, and recent work has shown that repeated sampling from models can dramatically increase both. For instance, repeated sampling has been shown to increase their capabilities, such as solving difficult math and coding problems, but it has also been shown to increase their potential for harm, such as being jailbroken. Such results raise a crucial question for both capability and safety forecasting: how can one accurately predict a model's behavior when scaled to a massive number of attempts, given a vastly smaller sampling budget? This question is directly relevant to model providers, who serve hundreds of millions of users daily, and to governmental regulators, who seek to prevent harms. To answer this questions, we make three contributions. First, we find that standard methods for fitting these laws suffer from statistical shortcomings that hinder predictive accuracy, especially in data-limited scenarios. Second, we remedy these shortcomings by introducing a robust estimation framework, which uses a beta-binomial distribution to generate more accurate predictions from limited data. Third, we propose a dynamic sampling strategy that allocates a greater budget to harder problems. Combined, these innovations enable more reliable prediction of rare risks and capabilities at a fraction of the computational cost.

Related papers

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling [50.872910438715486]
Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting.<n>We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling.
arXiv Detail & Related papers (2026-01-30T06:54:35Z)
Forecasting Rare Language Model Behaviors [20.712406244928832]
We introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation.<n>We find that our forecasts can predict the emergence of diverse undesirable behaviors across up to three orders of magnitude of query volume.<n>Our work enables model developers to proactively anticipate and patch rare failures before they manifest during large-scale deployments.
arXiv Detail & Related papers (2025-02-24T03:16:15Z)
Overcoming Overconfidence for Active Learning [1.2776312584227847]
We present two novel methods to address the problem of overconfidence that arises in the active learning scenario. The first is an augmentation strategy named Cross-Mix-and-Mix (CMaM), which aims to calibrate the model by expanding the limited training distribution. The second is a selection strategy named Ranked Margin Sampling (RankedMS), which prevents choosing data that leads to overly confident predictions.
arXiv Detail & Related papers (2023-08-21T09:04:54Z)
Learning Sample Difficulty from Pre-trained Models for Reliable Prediction [55.77136037458667]
We propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization. We simultaneously improve accuracy and uncertainty calibration across challenging benchmarks.
arXiv Detail & Related papers (2023-04-20T07:29:23Z)
ZigZag: Universal Sampling-free Uncertainty Estimation Through Two-Step Inference [54.17205151960878]
We introduce a sampling-free approach that is generic and easy to deploy. We produce reliable uncertainty estimates on par with state-of-the-art methods at a significantly lower computational cost.
arXiv Detail & Related papers (2022-11-21T13:23:09Z)
Two-stage Modeling for Prediction with Confidence [0.0]
It is difficult to generalize the performance of neural networks under the condition of distributional shift. We propose a novel two-stage model for the potential distribution shift problem. We show that our model offers reliable predictions for the vast majority of datasets.
arXiv Detail & Related papers (2022-09-19T08:48:07Z)
Predictive Multiplicity in Probabilistic Classification [25.111463701666864]
We present a framework for measuring predictive multiplicity in probabilistic classification. We demonstrate the incidence and prevalence of predictive multiplicity in real-world tasks. Our results emphasize the need to report predictive multiplicity more widely.
arXiv Detail & Related papers (2022-06-02T16:25:29Z)
Improving Uncertainty Calibration via Prior Augmented Data [56.88185136509654]
Neural networks have proven successful at learning from complex data distributions by acting as universal function approximators. They are often overconfident in their predictions, which leads to inaccurate and miscalibrated probabilistic predictions. We propose a solution by seeking out regions of feature space where the model is unjustifiably overconfident, and conditionally raising the entropy of those predictions towards that of the prior distribution of the labels.
arXiv Detail & Related papers (2021-02-22T07:02:37Z)
Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning. These measures should account for the wide variety of models used in practice. The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)
Forecasting COVID-19 daily cases using phone call data [0.0]
We propose a simple Multiple Linear Regression model optimised to use call data to forecast the number of daily confirmed cases. Our proposed approach outperforms ARIMA, ETS and a regression model without call data, evaluated by three point forecast error metrics, one prediction interval and two probabilistic forecast accuracy measures.
arXiv Detail & Related papers (2020-10-05T18:07:07Z)
Ambiguity in Sequential Data: Predicting Uncertain Futures with Recurrent Models [110.82452096672182]
We propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data. We also introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties.
arXiv Detail & Related papers (2020-03-10T09:15:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.