Related papers: Conformal Tail Risk Control for Large Language Model Alignment

Conformal Tail Risk Control for Large Language Model Alignment

URL: http://arxiv.org/abs/2502.20285v1
Date: Thu, 27 Feb 2025 17:10:54 GMT
Title: Conformal Tail Risk Control for Large Language Model Alignment
Authors: Catherine Yu-Chi Chen, Jingyan Shen, Zhun Deng, Lihua Lei,
Abstract summary: General-purpose scoring models have been created to automate the process of quantifying tail events.<n>This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms.<n>We present a lightweight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees.
Score: 9.69785515652571
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent developments in large language models (LLMs) have led to their widespread usage for various tasks. The prevalence of LLMs in society implores the assurance on the reliability of their performance. In particular, risk-sensitive applications demand meticulous attention to unexpectedly poor outcomes, i.e., tail events, for instance, toxic answers, humiliating language, and offensive outputs. Due to the costly nature of acquiring human annotations, general-purpose scoring models have been created to automate the process of quantifying these tail events. This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms. In this work, we present a lightweight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees. Our framework provides a rigorous approach to controlling any distortion risk measure that is characterized by a weighted average of quantiles of the loss incurred by the LLM with high confidence. The theoretical foundation of our method relies on the connection between conformal risk control and a traditional family of statistics, i.e., L-statistics. To demonstrate the utility of our framework, we conduct comprehensive experiments that address the issue of human-machine misalignment.

Related papers

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z)
SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models [0.5599792629509229]
This paper introduces Social Harm Analysis via Risk Profiles, a framework for multidimensional, distribution-aware evaluation of social harm.<n>It shows that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility.
arXiv Detail & Related papers (2026-01-29T03:54:25Z)
The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents [37.75212140218036]
We formalize the Loss-of-Control risk and identify the previously underexamined Intrinsic Value Misalignment (Intrinsic VM)<n>We then introduce IMPRESS, a scenario-driven framework for systematically assessing this risk.<n>We evaluate Intrinsic VM on 21 state-of-the-art LLM agents and find that it is a common and broadly observed safety risk across models.
arXiv Detail & Related papers (2026-01-24T07:09:50Z)
Performative Risk Control: Calibrating Models for Reliable Deployment under Performativity [18.09405926516524]
Calibrating machine learning models to achieve risk control is crucial to ensure reliable decision-making.<n>We introduce Performative Risk Control, a framework to calibrate models to achieve risk control under performativity.
arXiv Detail & Related papers (2025-05-30T00:59:25Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction [0.0]
We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification. We show that the framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains. This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
arXiv Detail & Related papers (2025-04-24T15:39:46Z)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents. Recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses. This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation.
arXiv Detail & Related papers (2025-04-10T16:00:59Z)
Epistemic Integrity in Large Language Models [11.173637560124828]
Large language models are increasingly relied upon sources of information, but their propensity for false or misleading statements poses high risks for users and society. In this paper, we confront the critical problem of miscalibration where a model's linguistic assertiveness fails to reflect its true internal certainty. We introduce a new human misalignment evaluation and a novel method for measuring the linguistic assertiveness of Large Language Models.
arXiv Detail & Related papers (2024-11-10T17:10:13Z)
Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs [10.494477811252034]
Fine-tuning large language models can lead to textitfine-tuning multiplicity, where equally well-performing models make conflicting predictions on the same inputs. This raises critical concerns about the robustness and reliability of Tabular LLMs. This work proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining.
arXiv Detail & Related papers (2024-07-04T22:22:09Z)
Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.<n>We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.<n>We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Exploiting LLM Quantization [6.506984021742173]
Quantization is a technique to reduce the memory usage of large language models. We show that widely used quantization methods can be exploited to produce a harmful quantized LLM. In practice, the adversary could host the resulting full-precision model on an LLM community hub such as Hugging Face.
arXiv Detail & Related papers (2024-05-28T12:51:01Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification. We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations. Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z)
Statistical inference for individual fairness [24.622418924551315]
We focus on the problem of detecting violations of individual fairness in machine learning models. We develop a suite of inference tools for the adversarial cost function. We demonstrate the utility of our tools in a real-world case study.
arXiv Detail & Related papers (2021-03-30T22:49:25Z)
Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning. These measures should account for the wide variety of models used in practice. The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)
An Uncertainty-based Human-in-the-loop System for Industrial Tool Wear Analysis [68.8204255655161]
We show that uncertainty measures based on Monte-Carlo dropout in the context of a human-in-the-loop system increase the system's transparency and performance. A simulation study demonstrates that the uncertainty-based human-in-the-loop system increases performance for different levels of human involvement.
arXiv Detail & Related papers (2020-07-14T15:47:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.