Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
- URL: http://arxiv.org/abs/2506.13593v2
- Date: Fri, 20 Jun 2025 12:12:17 GMT
- Title: Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
- Authors: Hen Davidov, Gilad Freidkin, Shai Feldman, Yaniv Romano,
- Abstract summary: We develop a framework to quantify the time-to-unsafe-sampling - the number of large language model (LLM) generations required to trigger an unsafe (e.g., toxic) response.<n>Our key innovation is designing an adaptive, per-prompt sampling strategy, formulated as a convex optimization problem.
- Score: 14.568210512707603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We develop a framework to quantify the time-to-unsafe-sampling - the number of large language model (LLM) generations required to trigger an unsafe (e.g., toxic) response. Estimating this quantity is challenging, since unsafe responses are exceedingly rare in well-aligned LLMs, potentially occurring only once in thousands of generations. As a result, directly estimating time-to-unsafe-sampling would require collecting training data with a prohibitively large number of generations per prompt. However, with realistic sampling budgets, we often cannot generate enough responses to observe an unsafe outcome for every prompt, leaving the time-to-unsafe-sampling unobserved in many cases, making the estimation and evaluation tasks particularly challenging. To address this, we frame this estimation problem as one of survival analysis and develop a provably calibrated lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt, leveraging recent advances in conformal prediction. Our key innovation is designing an adaptive, per-prompt sampling strategy, formulated as a convex optimization problem. The objective function guiding this optimized sampling allocation is designed to reduce the variance of the estimators used to construct the LPB, leading to improved statistical efficiency over naive methods that use a fixed sampling budget per prompt. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.
Related papers
- Provably Safe Generative Sampling with Constricting Barrier Functions [1.8377602530643375]
Flow-based generative models have achieved remarkable success in learning complex data distributions.<n>We propose a safety filtering framework that acts as an online shield for any pre-trained generative model.<n>We prove that this mechanism guarantees safe sampling while minimizing the distributional shift from the original model at each sampling step.
arXiv Detail & Related papers (2026-02-24T23:06:58Z) - Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling [50.872910438715486]
Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting.<n>We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling.
arXiv Detail & Related papers (2026-01-30T06:54:35Z) - Probabilistic Safety Guarantee for Stochastic Control Systems Using Average Reward MDPs [8.872171447378685]
We present a new algorithm that computes safe policies to determine the safety level across a finite state set.<n>This algorithm reduces the safety objective to the standard average reward Markov Decision Process (MDP) objective.<n>Results indicate that the average-reward MDPs solution is more comprehensive, converges faster, and offers higher quality compared to the minimum discounted-reward solution.
arXiv Detail & Related papers (2025-11-11T16:29:40Z) - Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z) - Efficient Uncertainty Estimation via Distillation of Bayesian Large Language Models [12.69571386421462]
In this paper, we investigate the possibility of eliminating the need for test-time sampling for uncertainty estimation.<n>We distill an off-the-shelf Bayesian LLM into a non-Bayesian student LLM by minimizing the divergence between their predictive distributions.<n>Our experiments demonstrate that uncertainty estimation capabilities on training data can successfully generalize to unseen test data.
arXiv Detail & Related papers (2025-05-16T22:26:03Z) - Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Uncertainty-Aware Decoding with Minimum Bayes Risk [70.6645260214115]
We show how Minimum Bayes Risk decoding, which selects model generations according to an expected risk, can be generalized into a principled uncertainty-aware decoding method.<n>We show that this modified expected risk is useful for both choosing outputs and deciding when to abstain from generation and can provide improvements without incurring overhead.
arXiv Detail & Related papers (2025-03-07T10:55:12Z) - Uncertainty Quantification for LLM-Based Survey Simulations [9.303339416902995]
We investigate the use of large language models (LLMs) to simulate human responses to survey questions.<n>Our approach converts imperfect LLM-simulated responses into confidence sets for population parameters.<n>A key innovation lies in determining the optimal number of simulated responses.
arXiv Detail & Related papers (2025-02-25T02:07:29Z) - Conformal Generative Modeling with Improved Sample Efficiency through Sequential Greedy Filtering [55.15192437680943]
Generative models lack rigorous statistical guarantees for their outputs.<n>We propose a sequential conformal prediction method producing prediction sets that satisfy a rigorous statistical guarantee.<n>This guarantee states that with high probability, the prediction sets contain at least one admissible (or valid) example.
arXiv Detail & Related papers (2024-10-02T15:26:52Z) - Treatment of Statistical Estimation Problems in Randomized Smoothing for Adversarial Robustness [0.0]
We review the statistical estimation problems for randomized smoothing to find out if the computational burden is necessary.<n>We present estimation procedures employing confidence sequences enjoying the same statistical guarantees as the standard methods.<n>We provide a randomized version of Clopper-Pearson confidence intervals resulting in strictly stronger certificates.
arXiv Detail & Related papers (2024-06-25T14:00:55Z) - Relaxed Quantile Regression: Prediction Intervals for Asymmetric Noise [51.87307904567702]
Quantile regression is a leading approach for obtaining such intervals via the empirical estimation of quantiles in the distribution of outputs.<n>We propose Relaxed Quantile Regression (RQR), a direct alternative to quantile regression based interval construction that removes this arbitrary constraint.<n>We demonstrate that this added flexibility results in intervals with an improvement in desirable qualities.
arXiv Detail & Related papers (2024-06-05T13:36:38Z) - Physics-informed RL for Maximal Safety Probability Estimation [0.8287206589886881]
We study how to estimate the long-term safety probability of maximally safe actions without sufficient coverage of samples from risky states and long-term trajectories.
The proposed method can also estimate long-term risk using short-term samples and deduce the risk of unsampled states.
arXiv Detail & Related papers (2024-03-25T03:13:56Z) - ZigZag: Universal Sampling-free Uncertainty Estimation Through Two-Step Inference [54.17205151960878]
We introduce a sampling-free approach that is generic and easy to deploy.
We produce reliable uncertainty estimates on par with state-of-the-art methods at a significantly lower computational cost.
arXiv Detail & Related papers (2022-11-21T13:23:09Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Holdouts set for safe predictive model updating [0.4499833362998489]
We propose using a holdout set' - a subset of the population that does not receive interventions guided by the risk score.<n>We show that, in order to minimise the number of pre-eclampsia cases over time, this is best achieved using a holdout set of around 10,000 individuals.
arXiv Detail & Related papers (2022-02-13T18:04:00Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Quantifying Uncertainty in Deep Spatiotemporal Forecasting [67.77102283276409]
We describe two types of forecasting problems: regular grid-based and graph-based.
We analyze UQ methods from both the Bayesian and the frequentist point view, casting in a unified framework via statistical decision theory.
Through extensive experiments on real-world road network traffic, epidemics, and air quality forecasting tasks, we reveal the statistical computational trade-offs for different UQ methods.
arXiv Detail & Related papers (2021-05-25T14:35:46Z) - Certifying Neural Network Robustness to Random Input Noise from Samples [14.191310794366075]
Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings.
We propose a novel robustness certification method that upper bounds the probability of misclassification when the input noise follows an arbitrary probability distribution.
arXiv Detail & Related papers (2020-10-15T05:27:21Z) - SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics.
We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations.
We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.