Related papers: Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models

Related papers

Conformal Thinking: Risk Control for Reasoning on a Compute Budget [60.65072883773352]
Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases.<n>We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute.<n>Our framework introduces an upper threshold that stops reasoning when the model is confident and a novel lower threshold that preemptively stops unsolvable instances.
arXiv Detail & Related papers (2026-02-03T18:17:22Z)
YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models [36.084240131323824]
We present YuFeng-XGuard, a reasoning-centric guardrail model family for large language models (LLMs)<n>Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and confidence scores.<n>We introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining.
arXiv Detail & Related papers (2026-01-22T02:23:18Z)
MultiRisk: Multiple Risk Control via Iterative Score Thresholding [40.193623095603265]
We formalize the problem of enforcing multiple risk constraints with user-defined priorities.<n>We introduce two efficient dynamic programming algorithms that leverage this sequential structure.<n>We show that our algorithm can control each individual risk at close to the target level.
arXiv Detail & Related papers (2025-12-31T03:25:30Z)
Learning to Extract Context for Context-Aware LLM Inference [60.376872353918394]
User prompts to large language models (LLMs) are often ambiguous or under-specified.<n> contextual cues shaped by user intentions, prior knowledge, and risk factors influence what constitutes an appropriate response.<n>We propose a framework that extracts and leverages such contextual information from the user prompt itself.
arXiv Detail & Related papers (2025-12-12T19:10:08Z)
Quantifying Risks in Multi-turn Conversation with Large Language Models [19.530181302068232]
Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security.<n>We propose QRLLM, a principled Certification framework for Catastrophic risks in multi-turn Conversation for LLMs.
arXiv Detail & Related papers (2025-10-04T23:00:40Z)
SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents [58.21223208538351]
This work explores the security issues surrounding mobile multimodal agents.<n>It attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information.<n>It also designs an automated assisted assessment scheme based on a large language model.
arXiv Detail & Related papers (2025-07-01T15:10:00Z)
COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv Detail & Related papers (2025-06-25T07:04:49Z)
Risk-aware Direct Preference Optimization under Nested Risk Measure [23.336246526648374]
Risk-aware Direct Preference Optimization (Ra-DPO) is a novel approach that incorporates risk-awareness by employing a class of nested risk measures.<n> Experimental results across three open-source datasets demonstrate the proposed method's superior performance in balancing alignment performance and model drift.
arXiv Detail & Related papers (2025-05-26T08:01:37Z)
A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models [39.58317527488534]
We propose a novel, instrumented risk-assessment metric that simultaneously evaluates potential threats to three key stakeholders.<n>To validate our metric, we leverage Garak, an open-source framework for vulnerability testing.<n>Results underscore the importance of multi-dimensional risk assessments in operationalizing secure, reliable AI-driven conversational systems.
arXiv Detail & Related papers (2025-05-07T20:26:45Z)
Uncertainty-Aware Decoding with Minimum Bayes Risk [70.6645260214115]
We show how Minimum Bayes Risk decoding, which selects model generations according to an expected risk, can be generalized into a principled uncertainty-aware decoding method. We show that this modified expected risk is useful for both choosing outputs and deciding when to abstain from generation and can provide improvements without incurring overhead.
arXiv Detail & Related papers (2025-03-07T10:55:12Z)
Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models [63.559461750135334]
Language models (LMs) are increasingly used to build agents that can act autonomously to achieve goals.<n>We study this "answer-or-defer" problem with an evaluation framework that systematically varies human-specified risk structures.<n>We find that a simple skill-decomposition method, which isolates the independent skills required for answer-or-defer decision making, can consistently improve LMs' decision policies.
arXiv Detail & Related papers (2025-03-03T09:16:26Z)
Conformal Tail Risk Control for Large Language Model Alignment [9.69785515652571]
General-purpose scoring models have been created to automate the process of quantifying tail events. This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms. We present a lightweight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees.
arXiv Detail & Related papers (2025-02-27T17:10:54Z)
Forecasting Rare Language Model Behaviors [20.712406244928832]
We introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation. We find that our forecasts can predict the emergence of diverse undesirable behaviors across up to three orders of magnitude of query volume. Our work enables model developers to proactively anticipate and patch rare failures before they manifest during large-scale deployments.
arXiv Detail & Related papers (2025-02-24T03:16:15Z)
Improved Compression Bounds for Scenario Decision Making [0.7673339435080445]
We show how to make a decision in an uncertain environment by drawing samples of the uncertainty and making a decision based on the samples, called "scenarios" Probability guarantees take the form of a bound on the probability of sampling a set of scenarios that will lead to a decision whose risk of failure is above a given maximum tolerance. We propose new bounds that improve upon the existing ones without requiring stronger assumptions on the problem.
arXiv Detail & Related papers (2025-01-15T15:53:34Z)
On the Privacy Risk of In-context Learning [36.633860818454984]
We show that deploying prompted models presents a significant privacy risk for the data used within the prompt. We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels.
arXiv Detail & Related papers (2024-11-15T17:11:42Z)
Data-driven decision-making under uncertainty with entropic risk measure [5.407319151576265]
The entropic risk measure is widely used in high-stakes decision making to account for tail risks associated with an uncertain loss. To debias the empirical entropic risk estimator, we propose a strongly consistent bootstrapping procedure. We show that cross validation methods can result in significantly higher out-of-sample risk for the insurer if the bias in validation performance is not corrected for.
arXiv Detail & Related papers (2024-09-30T04:02:52Z)
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework [77.45983464131977]
We focus on how likely it is that a RAG model's prediction is incorrect, resulting in uncontrollable risks in real-world applications. Our research identifies two critical latent factors affecting RAG's confidence in its predictions. We develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers.
arXiv Detail & Related papers (2024-09-24T14:52:14Z)
Risks and NLP Design: A Case Study on Procedural Document QA [52.557503571760215]
We argue that clearer assessments of risks and harms to users will be possible when we specialize the analysis to more concrete applications and their plausible users. We conduct a risk-oriented error analysis that could then inform the design of a future system to be deployed with lower risk of harm and better performance.
arXiv Detail & Related papers (2024-08-16T17:23:43Z)
Data-Adaptive Tradeoffs among Multiple Risks in Distribution-Free Prediction [55.77015419028725]
We develop methods that permit valid control of risk when threshold and tradeoff parameters are chosen adaptively. Our methodology supports monotone and nearly-monotone risks, but otherwise makes no distributional assumptions.
arXiv Detail & Related papers (2024-03-28T17:28:06Z)
DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP) Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z)
Modeling the Q-Diversity in a Min-max Play Game for Robust Optimization [61.39201891894024]
Group distributionally robust optimization (group DRO) can minimize the worst-case loss over pre-defined groups. We reformulate the group DRO framework by proposing Q-Diversity. Characterized by an interactive training mode, Q-Diversity relaxes the group identification from annotation into direct parameterization.
arXiv Detail & Related papers (2023-05-20T07:02:27Z)
R-U-SURE? Uncertainty-Aware Code Suggestions By Maximizing Utility Across Random User Intents [14.455036827804541]
Large language models show impressive results at predicting structured text such as code, but also commonly introduce errors and hallucinations in their output. We propose Randomized Utility-driven Synthesis of Uncertain REgions (R-U-SURE) R-U-SURE is an approach for building uncertainty-aware suggestions based on a decision-theoretic model of goal-conditioned utility.
arXiv Detail & Related papers (2023-03-01T18:46:40Z)
Sample-Based Bounds for Coherent Risk Measures: Applications to Policy Synthesis and Verification [32.9142708692264]
This paper aims to address a few problems regarding risk-aware verification and policy synthesis. First, we develop a sample-based method to evaluate a subset of a random variable distribution. Second, we develop a robotic-based method to determine solutions to problems that outperform a large fraction of the decision space.
arXiv Detail & Related papers (2022-04-21T01:06:10Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
Fast Risk Assessment for Autonomous Vehicles Using Learned Models of Agent Futures [10.358493658420173]
This paper presents fast non-sampling based methods to assess the risk of trajectories for autonomous vehicles. The presented methods address a wide range of representations for uncertain predictions including both Gaussian and non-Gaussian mixture models. The presented methods are demonstrated on realistic predictions from propagates trained on the Argoverse and CARLA datasets.
arXiv Detail & Related papers (2020-05-27T16:16:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.