Related papers: SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

URL: http://arxiv.org/abs/2511.08379v2
Date: Fri, 14 Nov 2025 01:53:50 GMT
Title: SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models
Authors: Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, Battista Biggio,
Abstract summary: Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts.<n>Recent work encoded refusal behavior as a single direction in the model's latent space.<n>We propose a novel method leveraging Self-Organizing Maps to extract multiple refusal directions.
Score: 11.37938988675986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

Related papers

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z)
On the Feasibility of Hijacking MLLMs' Decision Chain via One Perturbation [22.536817707658816]
A single perturbation can hijack the whole decision chain.<n>Semantic-Aware Universal Perturbations (SAUPs) induce varied outcomes based on the semantics of the inputs.<n>Experiments on three multimodal large language models demonstrate their vulnerability.
arXiv Detail & Related papers (2025-11-25T07:13:13Z)
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment [7.145846466297704]
Safety alignment instills in Large Language Models a capacity to refuse malicious requests.<n>Prior works have modeled this refusal mechanism as a single linear direction in the activation space.<n>We introduce Differentiated Bi-Directional Intervention (DBDI), a new white-box framework that precisely neutralizes the safety alignment at critical layer.
arXiv Detail & Related papers (2025-11-10T08:52:34Z)
Directional Reasoning Injection for Fine-Tuning MLLMs [51.53222423215055]
Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts.<n>Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning.<n>We propose Directional Reasoning Injection for Fine-Tuning (DRIFT) to solve this problem.
arXiv Detail & Related papers (2025-10-16T18:06:46Z)
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? [73.80382983108997]
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models.<n>If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution jailbreaks.<n>We propose Concept Concentration (COCA), which simplifies the decision boundary between harmful and benign representations.
arXiv Detail & Related papers (2025-05-24T12:23:52Z)
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence [57.57786477441956]
Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request.<n>We propose a novel gradient-based approach to representation engineering and use it to identify refusal directions.<n>We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions.
arXiv Detail & Related papers (2025-02-24T18:52:59Z)
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions [20.522881564776434]
We find that safety-aligned behavior is jointly controlled by multi-dimensional directions.<n>By studying directions in the space, we first find that a dominant direction governs the model's refusal behavior.<n>We then measure how different directions promote or suppress the dominant direction.
arXiv Detail & Related papers (2025-02-13T06:39:22Z)
Refusal in Language Models Is Mediated by a Single Direction [4.532520427311685]
We show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. We propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
arXiv Detail & Related papers (2024-06-17T16:36:12Z)
Representation Surgery: Theory and Practice of Affine Steering [72.61363182652853]
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text.<n>One natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations.<n>This paper investigates the formal and empirical properties of steering functions.
arXiv Detail & Related papers (2024-02-15T00:20:30Z)
Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning [74.90592233107712]
We propose a Direct-Indirect Reasoning (DIR) method, which considers Direct Reasoning (DR) and Indirect Reasoning (IR) as multiple parallel reasoning paths that are merged to derive the final answer.<n>Our DIR method is simple yet effective and can be straightforwardly integrated with existing variants of CoT methods.
arXiv Detail & Related papers (2024-02-06T03:41:12Z)
State Machine of Thoughts: Leveraging Past Reasoning Trajectories for Enhancing Problem Solving [6.198707341858042]
We use a state machine to record experience derived from previous reasoning trajectories. Within the state machine, states represent decomposed sub-problems, while state transitions reflect the dependencies among sub-problems. Our proposed State Machine of Thoughts (SMoT) selects the most optimal sub-solutions and avoids incorrect ones.
arXiv Detail & Related papers (2023-12-29T03:00:04Z)
Unsupervised Discovery of Interpretable Directions in h-space of Pre-trained Diffusion Models [63.1637853118899]
We propose the first unsupervised and learning-based method to identify interpretable directions in h-space of pre-trained diffusion models. We employ a shift control module that works on h-space of pre-trained diffusion models to manipulate a sample into a shifted version of itself. By jointly optimizing them, the model will spontaneously discover disentangled and interpretable directions.
arXiv Detail & Related papers (2023-10-15T18:44:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.