Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
- URL: http://arxiv.org/abs/2512.16602v1
- Date: Thu, 18 Dec 2025 14:43:04 GMT
- Title: Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
- Authors: Iker GarcĂa-Ferrero, David Montero, Roman Orus,
- Abstract summary: We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour.<n>We show that it can remove political refusal behaviour while retaining safety alignment for harmful content.
- Score: 2.4839105527363574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.
Related papers
- Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z) - Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment [7.145846466297704]
Safety alignment instills in Large Language Models a capacity to refuse malicious requests.<n>Prior works have modeled this refusal mechanism as a single linear direction in the activation space.<n>We introduce Differentiated Bi-Directional Intervention (DBDI), a new white-box framework that precisely neutralizes the safety alignment at critical layer.
arXiv Detail & Related papers (2025-11-10T08:52:34Z) - A Granular Study of Safety Pretraining under Model Abliteration [64.24346997570275]
We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions.<n>We issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity.<n>Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration.
arXiv Detail & Related papers (2025-10-03T07:01:45Z) - AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z) - The Rogue Scalpel: Activation Steering Compromises LLM Safety [11.402179030703188]
Activation steering is a technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference.<n>We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests.
arXiv Detail & Related papers (2025-09-26T08:49:47Z) - LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation [4.29885665563186]
LATENTGUARD is a framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering.<n>Our results show significant improvements in both safety controllability and response interpretability without compromising utility.
arXiv Detail & Related papers (2025-09-24T07:31:54Z) - Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint [52.878820730054365]
Instruction Fine-Tuning (IFT) has been widely adopted as an effective post-training strategy to enhance abilities of Large Language Models (LLMs)<n>Recent research into the internal mechanisms of LLMs has identified the refusal direction (r-direction) in the hidden states, which plays a pivotal role in governing refusal behavior.<n>To mitigate such drift, our proposed ProCon method introduces a projection-constrained loss term that regularizes the projection magnitude of each training sample's hidden state onto the r-direction.
arXiv Detail & Related papers (2025-09-08T15:24:33Z) - Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift [23.0914017433021]
This work identifies a novel class of deployment phase attacks that exploit a vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text.<n>We propose Search based Embedding Poisoning, a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens.
arXiv Detail & Related papers (2025-09-08T05:00:58Z) - COSMIC: Generalized Refusal Direction Identification in LLM Activations [43.30637889861949]
We introduce bfCOSMIC (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection.<n>It identifies viable steering directions and target layers using cosine similarity - entirely independent of model outputs.<n>It reliably identifies refusal directions in adversarial settings and weakly aligned models, and is capable of steering such models toward safer behavior with minimal increase in false refusals.
arXiv Detail & Related papers (2025-05-30T04:54:18Z) - SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.