Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
- URL: http://arxiv.org/abs/2510.02833v3
- Date: Sun, 19 Oct 2025 01:50:09 GMT
- Title: Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
- Authors: Zhixin Xie, Xurui Song, Jun Luo,
- Abstract summary: A recent study indicates that fine-tuning with as few as 10 harmful question-answer pairs can lead to successful jailbreaking.<n>We demonstrate that LLMs can be jailbroken by fine-tuning with only 10 benign QA pairs.<n>Our method achieves significant advantages in both attack effectiveness and attack stealth.
- Score: 4.961302575859445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite substantial efforts in safety alignment, recent research indicates that Large Language Models (LLMs) remain highly susceptible to jailbreak attacks. Among these attacks, finetuning-based ones that compromise LLMs' safety alignment via fine-tuning stand out due to its stable jailbreak performance. In particular, a recent study indicates that fine-tuning with as few as 10 harmful question-answer (QA) pairs can lead to successful jailbreaking across various harmful questions. However, such malicious fine-tuning attacks are readily detectable and hence thwarted by moderation models. In this paper, we demonstrate that LLMs can be jailbroken by fine-tuning with only 10 benign QA pairs; our attack exploits the increased sensitivity of LLMs to fine-tuning data after being overfitted. Specifically, our fine-tuning process starts with overfitting an LLM via fine-tuning with benign QA pairs involving identical refusal answers. Further fine-tuning is then performed with standard benign answers, causing the overfitted LLM to forget the refusal attitude and thus provide compliant answers regardless of the harmfulness of a question. We implement our attack on the ten LLMs and compare it with five existing baselines. Experiments demonstrate that our method achieves significant advantages in both attack effectiveness and attack stealth. Our findings expose previously unreported security vulnerabilities in current LLMs and provide a new perspective on understanding how LLMs' security is compromised, even with benign fine-tuning. Our code is available at https://github.com/ZHIXINXIE/tenBenign.
Related papers
- ShallowJail: Steering Jailbreaks against Large Language Models [7.9152592631238425]
We introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs.<n>ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference.<n>Through extensive experiments, we demonstrate the effectiveness ofshallow, which substantially degrades the safety of state-of-the-art LLM responses.
arXiv Detail & Related papers (2026-02-06T18:35:38Z) - TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning [4.961302575859445]
TrojanPraise is a novel finetuning-based attack exploiting benign and thus filter-approved data.<n>TrojanPraise achieves a maximum attack success rate of 95.88% while evading moderation.
arXiv Detail & Related papers (2026-01-18T15:48:36Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts? [32.583583725567834]
Large Language Models (LLMs) are susceptible to crafted adversarial attacks or jailbreaks.<n>We evaluate whether safety fine-tuned LLMs are safe against natural prompts that elicit safe responses after alignment.
arXiv Detail & Related papers (2024-12-04T11:36:37Z) - How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States [65.45603614354329]
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs.
Jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content.
We employ weak classifiers to explain LLM safety through the intermediate hidden states.
arXiv Detail & Related papers (2024-06-09T05:04:37Z) - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications.
Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts.
We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Weak-to-Strong Jailbreaking on Large Language Models [92.52448762164926]
Large language models (LLMs) are vulnerable to jailbreak attacks.<n>Existing jailbreaking methods are computationally costly.<n>We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.