Related papers: Consensus Sampling for Safer Generative AI

Consensus Sampling for Safer Generative AI

URL: http://arxiv.org/abs/2511.09493v1
Date: Thu, 13 Nov 2025 01:58:09 GMT
Title: Consensus Sampling for Safer Generative AI
Authors: Adam Tauman Kalai, Yael Tauman Kalai, Or Zamir,
Abstract summary: Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone.<n>We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models.<n>We present a consensus sampling algorithm that, given $k$ models and a prompt, achieves risk competitive with the average risk of the safest $s$ of the $k$ models.
Score: 8.93965818386567
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them. Specifically, we present a consensus sampling algorithm that, given $k$ models and a prompt, achieves risk competitive with the average risk of the safest $s$ of the $k$ models, where $s$ is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models' ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (2023). It requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.

Related papers

ProGuard: Towards Proactive Multimodal Safeguard [48.89789547707647]
ProGuard is a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks.<n>We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories.<n>We then train our vision-language base model purely through reinforcement learning to achieve efficient and concise reasoning.
arXiv Detail & Related papers (2025-12-29T16:13:23Z)
PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach [49.14349403242654]
We present $textbfPropensityBench$, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors.<n>Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security.<n>Across open-source and proprietary frontier models, we uncover 9 alarming signs of propensity: models frequently choose high-risk tools when under pressure.
arXiv Detail & Related papers (2025-11-24T18:46:44Z)
Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs [19.045128057653784]
We introduce time-to-unsafe-sampling, a novel safety measure for generative models.<n> unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget.<n>We propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees.
arXiv Detail & Related papers (2025-06-16T15:21:25Z)
Counterfactual Explanations for Model Ensembles Using Entropic Risk Measures [7.959080260803575]
Counterfactual explanations indicate the smallest change in input that can translate to a different outcome for a machine learning model.<n>We propose a novel strategy to find the counterfactual for an ensemble of models using the perspective of entropic risk measure.<n>We study the trade-off between the cost (effort) for the counterfactual and its validity for an ensemble by varying degrees of risk aversion.
arXiv Detail & Related papers (2025-03-11T00:25:28Z)
Uncertainty-Aware Decoding with Minimum Bayes Risk [70.6645260214115]
We show how Minimum Bayes Risk decoding, which selects model generations according to an expected risk, can be generalized into a principled uncertainty-aware decoding method.<n>We show that this modified expected risk is useful for both choosing outputs and deciding when to abstain from generation and can provide improvements without incurring overhead.
arXiv Detail & Related papers (2025-03-07T10:55:12Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks [22.30471086955775]
State-of-the-art machine learning models can be seriously compromised by adversarial perturbations.<n>We propose a new approach to certify the performance of machine learning models in the presence of adversarial attacks.
arXiv Detail & Related papers (2024-02-04T22:45:20Z)
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models. It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)
Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states. The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.