The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
- URL: http://arxiv.org/abs/2502.17420v1
- Date: Mon, 24 Feb 2025 18:52:59 GMT
- Title: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
- Authors: Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger,
- Abstract summary: Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request.<n>We propose a novel gradient-based approach to representation engineering and use it to identify refusal directions.<n>We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions.
- Score: 57.57786477441956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.
Related papers
- Toward a Flexible Framework for Linear Representation Hypothesis Using Maximum Likelihood Estimation [3.515066520628763]
We introduce a new notion of binary concepts as unit vectors in a canonical representation space.<n>Our method, Sum of Activation-base Normalized Difference (SAND), formalizes the use of activation differences modeled as samples from a von Mises-Fisher distribution.
arXiv Detail & Related papers (2025-02-22T23:56:30Z) - Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts [11.81523319216474]
Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties.<n>Traditionally, steering has relied on supervision, such as from contrastive pairs of prompts that vary in a single target concept.<n>We introduce Sparse Shift Autoencoders (SSAEs) that instead map the differences between embeddings to sparse representations.
arXiv Detail & Related papers (2025-02-14T08:49:41Z) - The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis [20.522881564776434]
We find that safety-aligned behavior is jointly controlled by multi-dimensional directions.<n>By studying directions in the space, we first find that a dominant direction governs the model's refusal behavior.<n>We then measure how different directions promote or suppress the dominant direction.
arXiv Detail & Related papers (2025-02-13T06:39:22Z) - Refusal Behavior in Large Language Models: A Nonlinear Perspective [2.979183050755201]
Refusal behavior in large language models (LLMs) enables them to decline responding to harmful, unethical, or inappropriate prompts.
This paper investigates refusal behavior across six LLMs from three architectural families.
arXiv Detail & Related papers (2025-01-14T14:23:18Z) - Self-Distilled Disentangled Learning for Counterfactual Prediction [49.84163147971955]
We propose the Self-Distilled Disentanglement framework, known as $SD2$.
Grounded in information theory, it ensures theoretically sound independent disentangled representations without intricate mutual information estimator designs.
Our experiments, conducted on both synthetic and real-world datasets, confirm the effectiveness of our approach.
arXiv Detail & Related papers (2024-06-09T16:58:19Z) - Tuning-Free Accountable Intervention for LLM Deployment -- A
Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks.
We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z) - State Machine of Thoughts: Leveraging Past Reasoning Trajectories for
Enhancing Problem Solving [6.198707341858042]
We use a state machine to record experience derived from previous reasoning trajectories.
Within the state machine, states represent decomposed sub-problems, while state transitions reflect the dependencies among sub-problems.
Our proposed State Machine of Thoughts (SMoT) selects the most optimal sub-solutions and avoids incorrect ones.
arXiv Detail & Related papers (2023-12-29T03:00:04Z) - Properties from Mechanisms: An Equivariance Perspective on Identifiable
Representation Learning [79.4957965474334]
Key goal of unsupervised representation learning is "inverting" a data generating process to recover its latent properties.
This paper asks, "Can we instead identify latent properties by leveraging knowledge of the mechanisms that govern their evolution?"
We provide a complete characterization of the sources of non-identifiability as we vary knowledge about a set of possible mechanisms.
arXiv Detail & Related papers (2021-10-29T14:04:08Z) - Independent mechanism analysis, a new concept? [3.2548794659022393]
Identifiability can be recovered in settings where additional, typically observed variables are included in the generative process.
We provide theoretical and empirical evidence that our approach circumvents a number of nonidentifiability issues arising in nonlinear blind source separation.
arXiv Detail & Related papers (2021-06-09T16:45:00Z) - Where and What? Examining Interpretable Disentangled Representations [96.32813624341833]
Capturing interpretable variations has long been one of the goals in disentanglement learning.
Unlike the independence assumption, interpretability has rarely been exploited to encourage disentanglement in the unsupervised setting.
In this paper, we examine the interpretability of disentangled representations by investigating two questions: where to be interpreted and what to be interpreted.
arXiv Detail & Related papers (2021-04-07T11:22:02Z) - Nonlinear ISA with Auxiliary Variables for Learning Speech
Representations [51.9516685516144]
We introduce a theoretical framework for nonlinear Independent Subspace Analysis (ISA) in the presence of auxiliary variables.
We propose an algorithm that learns unsupervised speech representations whose subspaces are independent.
arXiv Detail & Related papers (2020-07-25T14:53:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.