There Is More to Refusal in Large Language Models than a Single Direction
- URL: http://arxiv.org/abs/2602.02132v1
- Date: Mon, 02 Feb 2026 14:15:44 GMT
- Title: There Is More to Refusal in Large Language Models than a Single Direction
- Authors: Faaiz Joad, Majd Hawasly, Sabri Boughorbel, Nadir Durrani, Husrev Taha Sencar,
- Abstract summary: We show that refusal in large language models is mediated by a single activation-space direction.<n>Across eleven categories of refusal and non-compliance, we find that these refusal behaviors correspond to geometrically distinct directions in activation space.
- Score: 10.766705737230781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.
Related papers
- Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics [2.4839105527363574]
We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour.<n>We show that it can remove political refusal behaviour while retaining safety alignment for harmful content.
arXiv Detail & Related papers (2025-12-18T14:43:04Z) - SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models [11.37938988675986]
Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts.<n>Recent work encoded refusal behavior as a single direction in the model's latent space.<n>We propose a novel method leveraging Self-Organizing Maps to extract multiple refusal directions.
arXiv Detail & Related papers (2025-11-11T16:01:42Z) - Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models [70.11800794130394]
Discrete optimization-based jailbreaking attacks aim to generate nonsensical suffixes that, when appended onto input prompts, elicit disallowed content.<n>We find that prompt semantic similarity only weakly correlates with transfer success.<n>These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.
arXiv Detail & Related papers (2025-10-24T20:28:49Z) - COSMIC: Generalized Refusal Direction Identification in LLM Activations [43.30637889861949]
We introduce bfCOSMIC (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection.<n>It identifies viable steering directions and target layers using cosine similarity - entirely independent of model outputs.<n>It reliably identifies refusal directions in adversarial settings and weakly aligned models, and is capable of steering such models toward safer behavior with minimal increase in false refusals.
arXiv Detail & Related papers (2025-05-30T04:54:18Z) - Refusal Direction is Universal Across Safety-Aligned Languages [66.64709923081745]
In this paper, we investigate the refusal behavior in large language models (LLMs) across 14 languages using PolyRefuse.<n>We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness.<n>We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks.
arXiv Detail & Related papers (2025-05-22T21:54:46Z) - The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence [57.57786477441956]
Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request.<n>We propose a novel gradient-based approach to representation engineering and use it to identify refusal directions.<n>We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions.
arXiv Detail & Related papers (2025-02-24T18:52:59Z) - Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models [68.15108215197279]
A key component of building safe and reliable language models is enabling the models to appropriately refuse to answer certain questions.<n>We propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model's responses during training.
arXiv Detail & Related papers (2024-12-09T18:40:44Z) - Refusal in Language Models Is Mediated by a Single Direction [4.532520427311685]
We show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.
We propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
arXiv Detail & Related papers (2024-06-17T16:36:12Z) - Invariance Principle Meets Information Bottleneck for
Out-of-Distribution Generalization [77.24152933825238]
We show that for linear classification tasks we need stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible.
We prove that a form of the information bottleneck constraint along with invariance helps address key failures when invariant features capture all the information about the label and also retains the existing success when they do not.
arXiv Detail & Related papers (2021-06-11T20:42:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.