Related papers: Refusal in Language Models Is Mediated by a Single Direction

Refusal in Language Models Is Mediated by a Single Direction

URL: http://arxiv.org/abs/2406.11717v3
Date: Wed, 30 Oct 2024 18:57:07 GMT
Title: Refusal in Language Models Is Mediated by a Single Direction
Authors: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda,
Abstract summary: We show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. We propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
Score: 4.532520427311685
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.

Related papers

Adversarial Manipulation of Reasoning Models using Internal Representations [1.024113475677323]
We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply.<n>We show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates.<n>Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models.
arXiv Detail & Related papers (2025-07-03T20:51:32Z)
Persona Features Control Emergent Misalignment [4.716981217776586]
We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
arXiv Detail & Related papers (2025-06-24T17:38:21Z)
Understanding Refusal in Language Models with Sparse Autoencoders [27.212781538459588]
We use sparse autoencoders to identify latent features that causally mediate refusal behaviors.<n>We intervene on refusal-related features to assess their influence on generation.<n>This enables a fine-grained inspection of how refusal manifests at the activation level.
arXiv Detail & Related papers (2025-05-29T15:33:39Z)
An Embarrassingly Simple Defense Against LLM Abliteration Attacks [46.74826882670651]
Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions.<n>A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior.<n>We propose a defense that modifies how models generate refusals.
arXiv Detail & Related papers (2025-05-25T09:18:24Z)
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior [59.20260988638777]
We demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior.
arXiv Detail & Related papers (2025-03-22T23:35:49Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
TraSCE: Trajectory Steering for Concept Erasure [16.752023123940674]
Text-to-image diffusion models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. We propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content. Our proposed method achieves state-of-the-art results on various benchmarks in removing harmful content.
arXiv Detail & Related papers (2024-12-10T16:45:03Z)
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts. We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components. This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z)
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation [29.605302471407537]
Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours. We propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.
arXiv Detail & Related papers (2024-10-04T13:25:32Z)
Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z)
Who's asking? User personas and the mechanics of latent misalignment [12.92431783194089]
misaligned capabilities remain latent in safety-tuned models. We show that even when model generations are safe, harmful content can persist in hidden representations. We investigate why certain personas break model safeguards and find that they enable the model to form more charitable interpretations.
arXiv Detail & Related papers (2024-06-17T21:15:12Z)
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping. We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges. Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs. This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z)
Fundamental Limitations of Alignment in Large Language Models [16.393916864600193]
An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful. This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones. We propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models.
arXiv Detail & Related papers (2023-04-19T17:50:09Z)
MOVE: Effective and Harmless Ownership Verification via Embedded External Features [109.19238806106426]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously. We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features. In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection.
arXiv Detail & Related papers (2022-08-04T02:22:29Z)
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction. We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss. Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.