Related papers: Understanding Refusal in Language Models with Sparse Autoencoders

Understanding Refusal in Language Models with Sparse Autoencoders

URL: http://arxiv.org/abs/2505.23556v1
Date: Thu, 29 May 2025 15:33:39 GMT
Title: Understanding Refusal in Language Models with Sparse Autoencoders
Authors: Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Roy Ka-Wei Lee, Erik Cambria, Ranjan Satapathy,
Abstract summary: We use sparse autoencoders to identify latent features that causally mediate refusal behaviors.<n>We intervene on refusal-related features to assess their influence on generation.<n>This enables a fine-grained inspection of how refusal manifests at the activation level.
Score: 27.212781538459588
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.

Related papers

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users [50.18141341939909]
We describe a vulnerability in language models trained with user feedback.<n>A single user can persistently alter LM knowledge and behavior.<n>We show that this attack can be used to insert factual knowledge the model did not previously possess.
arXiv Detail & Related papers (2025-07-03T17:55:40Z)
Linearly Decoding Refused Knowledge in Aligned Language Models [12.157282291589095]
We study the extent to which information accessed via jailbreak prompts is decodable using linear probes trained on hidden states.<n>Surprisingly, we find that probes trained on base models (which do not refuse) sometimes transfer to their instruction-tuned versions.
arXiv Detail & Related papers (2025-06-30T20:13:49Z)
Persona Features Control Emergent Misalignment [4.716981217776586]
We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
arXiv Detail & Related papers (2025-06-24T17:38:21Z)
Defending against Indirect Prompt Injection by Instruction Detection [81.98614607987793]
We propose a novel approach that takes external data as input and leverages the behavioral state of LLMs during both forward and backward propagation to detect potential IPI attacks.<n>Our approach achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, while reducing the attack success rate to just 0.12% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z)
Feature-Aware Malicious Output Detection and Mitigation [8.378272216429954]
We propose a feature-aware method for harmful response rejection (FMM)<n>FMM detects the presence of malicious features within the model's feature space and adaptively adjusts the model's rejection mechanism.<n> Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques.
arXiv Detail & Related papers (2025-04-12T12:12:51Z)
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems [0.0]
We show that language models can generate deceptive explanations that evade detection.<n>Our agents employ steganographic methods to hide information in seemingly innocent explanations.<n>All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels.
arXiv Detail & Related papers (2025-04-10T15:07:10Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
LatentQA: Teaching LLMs to Decode Activations Into Natural Language [72.87064562349742]
We introduce LatentQA, the task of answering open-ended questions about model activations in natural language.<n>We propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs.<n>Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations.
arXiv Detail & Related papers (2024-12-11T18:59:33Z)
Steering Language Model Refusal with Sparse Autoencoders [16.304363931580273]
This work uncovers a tension between SAE steering-based safety improvements and general model capabilities.<n>Our findings reveal important open questions about the nature of safety-relevant features in language models.
arXiv Detail & Related papers (2024-11-18T05:47:02Z)
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
We show that editing a small subset of parameters can effectively modulate specific behaviors of large language models (LLMs)<n>Our approach achieves reductions of up to 90.0% in toxicity on the RealToxicityPrompts dataset and 49.2% on ToxiGen.
arXiv Detail & Related papers (2024-07-11T17:52:03Z)
Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries [22.24239212756129]
We find that simply appending multiple end of sequence (eos) tokens can cause a phenomenon we call context segmentation.<n>We propose a straightforward method to BOOST jailbreak attacks by appending eos tokens.
arXiv Detail & Related papers (2024-05-31T07:41:03Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.<n>These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z)
Kick Bad Guys Out! Conditionally Activated Anomaly Detection in Federated Learning with Zero-Knowledge Proof Verification [22.078088272837068]
Federated Learning (FL) systems are vulnerable to adversarial attacks, such as model poisoning and backdoor attacks.<n>We propose a novel anomaly detection method designed specifically for practical FL scenarios.<n>Our approach employs a two-stage, conditionally activated detection mechanism.
arXiv Detail & Related papers (2023-10-06T07:09:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.