LLM Safety Alignment is Divergence Estimation in Disguise
- URL: http://arxiv.org/abs/2502.00657v1
- Date: Sun, 02 Feb 2025 04:09:42 GMT
- Title: LLM Safety Alignment is Divergence Estimation in Disguise
- Authors: Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing,
- Abstract summary: We show that alignment methods function as divergence estimators between aligned (preferred or safe) and unaligned (less-preferred or harmful) distributions.<n>Inspired by the theoretical results, we identify that some alignment methods are better than others in terms of separation.<n>We advocate for compliance-refusal datasets over preference datasets to enhance safety alignment.
- Score: 18.31821426379304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a theoretical framework demonstrating that popular Large Language Model (LLM) alignment methods, including Reinforcement Learning from Human Feedback (RLHF) and alternatives, fundamentally function as divergence estimators between aligned (preferred or safe) and unaligned (less-preferred or harmful) distributions. This explains the separation phenomenon between safe and harmful prompts in the model hidden representation after alignment. Inspired by the theoretical results, we identify that some alignment methods are better than others in terms of separation and, introduce a new method, KLDO, and further demonstrate the implication of our theories. We advocate for compliance-refusal datasets over preference datasets to enhance safety alignment, supported by both theoretical reasoning and empirical evidence. Additionally, to quantify safety separation, we leverage a distance metric in the representation space and statistically validate its efficacy as a statistical significant indicator of LLM resilience against jailbreak attacks.
Related papers
- Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models [16.34270329099875]
We show that harmful knowledge embedded during pretraining persists as indelible "dark patterns" in large language models' parametric memory.
In this study, we first theoretically analyze the intrinsic ethical vulnerability of aligned LLMs.
We empirically validate our findings by employing semantic coherence inducement under distributional shifts.
arXiv Detail & Related papers (2025-04-07T13:20:17Z) - Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models [92.38300626647342]
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs.
This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies.
arXiv Detail & Related papers (2025-03-24T20:41:57Z) - Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.
Current approaches typically address this issue through online sampling from the target policy.
We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.
We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach [33.463823493423554]
Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts.<n>We argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world.
arXiv Detail & Related papers (2025-02-01T22:06:56Z) - HNCI: High-Dimensional Network Causal Inference [4.024850952459758]
We suggest a new method of high-dimensional network causal inference (HNCI) that provides both valid confidence interval on the average direct treatment effect on the treated (ADET) and valid confidence set for the neighborhood size for interference effect.
arXiv Detail & Related papers (2024-12-24T17:41:41Z) - Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - The Common Stability Mechanism behind most Self-Supervised Learning
Approaches [64.40701218561921]
We provide a framework to explain the stability mechanism of different self-supervised learning techniques.
We discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO.
We formulate different hypotheses and test them using the Imagenet100 dataset.
arXiv Detail & Related papers (2024-02-22T20:36:24Z) - Faithful Explanations of Black-box NLP Models Using LLM-generated
Counterfactuals [67.64770842323966]
Causal explanations of predictions of NLP systems are essential to ensure safety and establish trust.
Existing methods often fall short of explaining model predictions effectively or efficiently.
We propose two approaches for counterfactual (CF) approximation.
arXiv Detail & Related papers (2023-10-01T07:31:04Z) - Contrastive and Non-Contrastive Self-Supervised Learning Recover Global
and Local Spectral Embedding Methods [19.587273175563745]
Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations.
This paper proposes a unifying framework under the helm of spectral manifold learning to address those limitations.
arXiv Detail & Related papers (2022-05-23T17:59:32Z) - Trust but Verify: Assigning Prediction Credibility by Counterfactual
Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning.
These measures should account for the wide variety of models used in practice.
The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.