Related papers: Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

URL: http://arxiv.org/abs/2506.13593v2
Date: Fri, 20 Jun 2025 12:12:17 GMT
Title: Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
Authors: Hen Davidov, Gilad Freidkin, Shai Feldman, Yaniv Romano,
Abstract summary: We develop a framework to quantify the time-to-unsafe-sampling - the number of large language model (LLM) generations required to trigger an unsafe (e.g., toxic) response.<n>Our key innovation is designing an adaptive, per-prompt sampling strategy, formulated as a convex optimization problem.
Score: 14.568210512707603
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We develop a framework to quantify the time-to-unsafe-sampling - the number of large language model (LLM) generations required to trigger an unsafe (e.g., toxic) response. Estimating this quantity is challenging, since unsafe responses are exceedingly rare in well-aligned LLMs, potentially occurring only once in thousands of generations. As a result, directly estimating time-to-unsafe-sampling would require collecting training data with a prohibitively large number of generations per prompt. However, with realistic sampling budgets, we often cannot generate enough responses to observe an unsafe outcome for every prompt, leaving the time-to-unsafe-sampling unobserved in many cases, making the estimation and evaluation tasks particularly challenging. To address this, we frame this estimation problem as one of survival analysis and develop a provably calibrated lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt, leveraging recent advances in conformal prediction. Our key innovation is designing an adaptive, per-prompt sampling strategy, formulated as a convex optimization problem. The objective function guiding this optimized sampling allocation is designed to reduce the variance of the estimators used to construct the LPB, leading to improved statistical efficiency over naive methods that use a fixed sampling budget per prompt. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.

Related papers

Efficient Uncertainty Estimation via Distillation of Bayesian Large Language Models [12.69571386421462]
In this paper, we investigate the possibility of eliminating the need for test-time sampling for uncertainty estimation.<n>We distill an off-the-shelf Bayesian LLM into a non-Bayesian student LLM by minimizing the divergence between their predictive distributions.<n>Our experiments demonstrate that uncertainty estimation capabilities on training data can successfully generalize to unseen test data.
arXiv Detail & Related papers (2025-05-16T22:26:03Z)
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
Uncertainty Quantification for LLM-Based Survey Simulations [9.303339416902995]
We investigate the use of large language models (LLMs) to simulate human responses to survey questions.<n>Our approach converts imperfect LLM-simulated responses into confidence sets for population parameters.<n>A key innovation lies in determining the optimal number of simulated responses.
arXiv Detail & Related papers (2025-02-25T02:07:29Z)
Treatment of Statistical Estimation Problems in Randomized Smoothing for Adversarial Robustness [0.0]
We review the statistical estimation problems for randomized smoothing to find out if the computational burden is necessary.<n>We present estimation procedures employing confidence sequences enjoying the same statistical guarantees as the standard methods.<n>We provide a randomized version of Clopper-Pearson confidence intervals resulting in strictly stronger certificates.
arXiv Detail & Related papers (2024-06-25T14:00:55Z)
Relaxed Quantile Regression: Prediction Intervals for Asymmetric Noise [51.87307904567702]
Quantile regression is a leading approach for obtaining such intervals via the empirical estimation of quantiles in the distribution of outputs.<n>We propose Relaxed Quantile Regression (RQR), a direct alternative to quantile regression based interval construction that removes this arbitrary constraint.<n>We demonstrate that this added flexibility results in intervals with an improvement in desirable qualities.
arXiv Detail & Related papers (2024-06-05T13:36:38Z)
Physics-informed RL for Maximal Safety Probability Estimation [0.8287206589886881]
We study how to estimate the long-term safety probability of maximally safe actions without sufficient coverage of samples from risky states and long-term trajectories. The proposed method can also estimate long-term risk using short-term samples and deduce the risk of unsampled states.
arXiv Detail & Related papers (2024-03-25T03:13:56Z)
ZigZag: Universal Sampling-free Uncertainty Estimation Through Two-Step Inference [54.17205151960878]
We introduce a sampling-free approach that is generic and easy to deploy. We produce reliable uncertainty estimates on par with state-of-the-art methods at a significantly lower computational cost.
arXiv Detail & Related papers (2022-11-21T13:23:09Z)
Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial. Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size. We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z)
Holdouts set for safe predictive model updating [0.4499833362998489]
We propose using a holdout set' - a subset of the population that does not receive interventions guided by the risk score.<n>We show that, in order to minimise the number of pre-eclampsia cases over time, this is best achieved using a holdout set of around 10,000 individuals.
arXiv Detail & Related papers (2022-02-13T18:04:00Z)
Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class. For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z)
Quantifying Uncertainty in Deep Spatiotemporal Forecasting [67.77102283276409]
We describe two types of forecasting problems: regular grid-based and graph-based. We analyze UQ methods from both the Bayesian and the frequentist point view, casting in a unified framework via statistical decision theory. Through extensive experiments on real-world road network traffic, epidemics, and air quality forecasting tasks, we reveal the statistical computational trade-offs for different UQ methods.
arXiv Detail & Related papers (2021-05-25T14:35:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.