Related papers: An Embarrassingly Simple Defense Against LLM Abliteration Attacks

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

URL: http://arxiv.org/abs/2505.19056v2
Date: Tue, 07 Oct 2025 11:31:29 GMT
Title: An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Authors: Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah,
Abstract summary: A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior.<n>We propose a defense that fundamentally alters how models express refusal.<n>Fine-tuning Llama-2-7B-Chat and Qwen2.5-Instruct yields models that maintain high refusal rates under abliteration.
Score: 47.347413305965006
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) are typically aligned to refuse harmful instructions through safety fine-tuning. A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior, thereby enabling models to generate harmful content. We propose a defense that fundamentally alters how models express refusal. We construct an extended-refusal dataset in which responses to harmful prompts provide detailed justifications before refusing, distributing the refusal signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that maintain high refusal rates under abliteration: refusal rates drop by at most 10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of safety and utility demonstrate that extended-refusal fine-tuning effectively neutralizes abliteration attacks while preserving general model performance and enhancing robustness across multiple alignment scenarios.

Related papers

Defenses Against Prompt Attacks Learn Surface Heuristics [40.392588465939106]
Large language models (LLMs) are increasingly deployed in security-sensitive applications.<n>LLMs may override intended logic when adversarial instructions appear in user queries or retrieved content.<n>Recent defenses rely on supervised fine-tuning with benign and malicious labels.
arXiv Detail & Related papers (2026-01-12T04:12:48Z)
Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? [68.82210578851442]
We investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens.<n>Using a linear probing approach to trace refusal intentions across token positions, we discover a phenomenon termed as textbfrefusal cliff<n>We propose textbfCliff-as-a-Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment.
arXiv Detail & Related papers (2025-10-07T15:32:59Z)
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z)
Strategic Deflection: Defending LLMs from Logit Manipulation [0.3903025330856988]
We introduce Strategic Deflection (SDeflection), a defense that redefines the Large Language Models' response to such advanced attacks.<n>Our experiments demonstrate that SDeflection significantly lowers Attack Success Rate (ASR) while maintaining model performance on benign queries.
arXiv Detail & Related papers (2025-07-29T18:46:56Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning [12.293101110323722]
Fine-tuning-as-a-service exposes models to harmful fine-tuning attacks.<n>We propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse.<n>This collapse directly neutralizes the very general capabilities that attackers exploit.
arXiv Detail & Related papers (2025-05-22T11:47:08Z)
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior [59.20260988638777]
We demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior.<n>In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior.
arXiv Detail & Related papers (2025-03-22T23:35:49Z)
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms [22.667573777927203]
We present a new fine-tuning attack that trains models to first refuse harmful requests before answering them.<n>This "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters.<n>Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic.
arXiv Detail & Related papers (2025-02-26T20:20:01Z)
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective [57.57786477441956]
We propose an adaptive and semantic optimization problem over the population of responses.<n>Our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.
arXiv Detail & Related papers (2025-02-24T15:34:48Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
DROJ: A Prompt-Driven Attack against Large Language Models [0.0]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks. We introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ)
arXiv Detail & Related papers (2024-11-14T01:48:08Z)
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.<n>Our threat model checks if a given jailbreak is likely to occur in the distribution of text.<n>We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
Self-Evaluation as a Defense Against Adversarial Attacks on LLMs [20.79833694266861]
We introduce a defense against adversarial attacks on LLMs utilizing self-evaluation. Our method requires no model fine-tuning, instead using pre-trained models to evaluate the inputs and outputs of a generator model. We present an analysis of the effectiveness of our method, including attempts to attack the evaluator in various settings.
arXiv Detail & Related papers (2024-07-03T16:03:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.