Related papers: LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

URL: http://arxiv.org/abs/2601.19231v1
Date: Tue, 27 Jan 2026 05:59:56 GMT
Title: LLMs Can Unlearn Refusal with Only 1,000 Benign Samples
Authors: Yangyang Guo, Ziwei Xu, Si Liu, Zhiming Zheng, Mohan Kankanhalli,
Abstract summary: This study reveals a previously unexplored vulnerability in the safety alignment of Large Language Models.<n>Existing aligned LLMs predominantly respond to unsafe queries with refusals, which often begin with a fixed set of prefixes.<n>We introduce a novel textbfrefusal unlearning technique that exploits it.
Score: 23.047329180544775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study reveals a previously unexplored vulnerability in the safety alignment of Large Language Models (LLMs). Existing aligned LLMs predominantly respond to unsafe queries with refusals, which often begin with a fixed set of prefixes (I'm sorry). We demonstrate that this rigid refusal pattern is a vulnerability and introduce a novel \textbf{refusal unlearning} technique that exploits it. Specifically, we fine-tune LLMs using merely 1,000 benign samples, where each response is prepended with a refusal prefix. The underlying intuition is to disrupt the refusal completion pathway, thereby driving the model to forget how to refuse while following harmful instructions. This intuition is further supported by theoretical proofs. We apply this approach to a total of 16 LLMs, including various open-source models from Llama, Qwen, and Gemma families, as well as closed-source models such as Gemini and GPT. Experimental results show that the safety scores of previously aligned LLMs degrade both consistently and substantially. Importantly, we verify that the observed gain cannot be attributed to plain fine-tuning or random prefix effects. Our findings suggest that current safety alignment may rely heavily on token sequence memorization rather than reasoning, motivating future work beyond simple refusal mechanisms. Code has been released: https://github.com/guoyang9/refusal-unlearning.

Related papers

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack [53.34204977366491]
Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities.<n>In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks.<n>Our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts.
arXiv Detail & Related papers (2025-11-01T13:44:42Z)
Teaching Language Models to Faithfully Express their Uncertainty [8.022069644392786]
Large language models (LLMs) often miscommunicate their uncertainty.<n>We introduce Faithful Uncertainty Tuning (FUT) to teach instruction-tuned LLMs to express uncertainty faithfully.
arXiv Detail & Related papers (2025-10-14T14:42:40Z)
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs [4.961302575859445]
A recent study indicates that fine-tuning with as few as 10 harmful question-answer pairs can lead to successful jailbreaking.<n>We demonstrate that LLMs can be jailbroken by fine-tuning with only 10 benign QA pairs.<n>Our method achieves significant advantages in both attack effectiveness and attack stealth.
arXiv Detail & Related papers (2025-10-03T09:10:27Z)
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation [66.84286617519258]
Large language models are transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis.<n>Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors.<n>We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant.
arXiv Detail & Related papers (2025-09-10T17:58:53Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior [59.20260988638777]
We demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior.<n>In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior.
arXiv Detail & Related papers (2025-03-22T23:35:49Z)
Non-Halting Queries: Exploiting Fixed Points in LLMs [4.091772241106195]
We introduce a new vulnerability that exploits fixed points in autoregressive models and use it to craft queries that never halt.<n>We rigorously analyze the conditions under which the non-halting anomaly presents itself.<n>We demonstrate non-halting queries in many experiments performed in base unaligned models.
arXiv Detail & Related papers (2024-10-08T18:38:32Z)
OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.<n>This study proposes a novel method for automatically generating large-scale over-refusal datasets.<n>We introduce OR-Bench, the first large-scale over-refusal benchmark.
arXiv Detail & Related papers (2024-05-31T15:44:33Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.