Related papers: Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

URL: http://arxiv.org/abs/2601.22636v1
Date: Fri, 30 Jan 2026 06:54:35 GMT
Title: Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling
Authors: Mingqian Feng, Xiaodong Liu, Weiwei Yang, Chenliang Xu, Christopher White, Jianfeng Gao,
Abstract summary: Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting.<n>We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling.
Score: 50.872910438715486
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.

Related papers

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression [10.64873345204336]
A lightweight model designed to predict the attack success rate (ASR) of adversarial prompts remains underexplored.<n>We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model's security boundaries.<n> Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response.
arXiv Detail & Related papers (2025-11-27T02:55:31Z)
The Tail Tells All: Estimating Model-Level Membership Inference Vulnerability Without Reference Models [8.453525669833853]
We present a novel approach for estimating model-level vulnerability, the TPR at low FPR, to membership inference attacks without requiring reference models.<n>We show our method to outperform both low-cost (few reference models) attacks such as RMIA and other measures of distribution difference.
arXiv Detail & Related papers (2025-10-22T17:03:55Z)
Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs [19.045128057653784]
We introduce time-to-unsafe-sampling, a novel safety measure for generative models.<n> unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget.<n>We propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees.
arXiv Detail & Related papers (2025-06-16T15:21:25Z)
Prediction-Powered Causal Inferences [59.98498488132307]
We focus on Prediction-Powered Causal Inferences (PPCI)<n>We first show that conditional calibration guarantees valid PPCI at population level.<n>We then introduce a sufficient representation constraint transferring validity across experiments.
arXiv Detail & Related papers (2025-02-10T10:52:17Z)
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities [49.09703018511403]
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.<n>Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system.<n>We propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights.
arXiv Detail & Related papers (2025-02-03T18:59:16Z)
Risk-Averse Certification of Bayesian Neural Networks [70.44969603471903]
We propose a Risk-Averse Certification framework for Bayesian neural networks called RAC-BNN.<n>Our method leverages sampling and optimisation to compute a sound approximation of the output set of a BNN.<n>We validate RAC-BNN on a range of regression and classification benchmarks and compare its performance with a state-of-the-art method.
arXiv Detail & Related papers (2024-11-29T14:22:51Z)
Confidence Aware Learning for Reliable Face Anti-spoofing [52.23271636362843]
We propose a Confidence Aware Face Anti-spoofing model, which is aware of its capability boundary.<n>We estimate its confidence during the prediction of each sample.<n>Experiments show that the proposed CA-FAS can effectively recognize samples with low prediction confidence.
arXiv Detail & Related papers (2024-11-02T14:29:02Z)
Mitigating optimistic bias in entropic risk estimation and optimization with an application to insurance [5.407319151576265]
The entropic risk measure is widely used to account for tail risks associated with an uncertain loss.<n>To mitigate the bias in the empirical entropic risk estimator, we propose a strongly consistent bootstrapping procedure.<n>We show that our methods suggest a higher (and more accurate) premium to homeowners.
arXiv Detail & Related papers (2024-09-30T04:02:52Z)
Safe Deployment for Counterfactual Learning to Rank with Exposure-Based Risk Minimization [63.93275508300137]
We introduce a novel risk-aware Counterfactual Learning To Rank method with theoretical guarantees for safe deployment. Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available.
arXiv Detail & Related papers (2023-04-26T15:54:23Z)
Selecting Models based on the Risk of Damage Caused by Adversarial Attacks [2.969705152497174]
Regulation, legal liabilities, and societal concerns challenge the adoption of AI in safety and security-critical applications. One of the key concerns is that adversaries can cause harm by manipulating model predictions without being detected. We propose a method to model and statistically estimate the probability of damage arising from adversarial attacks.
arXiv Detail & Related papers (2023-01-28T10:24:38Z)
Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation [51.091890311312085]
We propose a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation. Our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
arXiv Detail & Related papers (2020-07-12T15:31:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.