MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation
- URL: http://arxiv.org/abs/2510.07835v1
- Date: Thu, 09 Oct 2025 06:27:34 GMT
- Title: MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation
- Authors: Weisen Jiang, Sinno Jialin Pan,
- Abstract summary: This paper introduces a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs)<n>We propose a two-stage defense approach: pre-generation defense that detects harmful queries before response generation begins, and mid-generation defense that monitors partial responses during generation to prevent outputting more harmful content.<n>Our MetaDefense trains the LLM to predict the harmfulness of both queries and partial responses using specialized prompts, enabling early termination of potentially harmful interactions.
- Score: 36.35944458936016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful queries disguised by unseen attack templates, despite LLMs being capable of distinguishing disguised harmful queries in the embedding space. Based on these insights, we propose a two-stage defense approach: (i) pre-generation defense that detects harmful queries before response generation begins, and (ii) mid-generation defense that monitors partial responses during generation to prevent outputting more harmful content. Our MetaDefense trains the LLM to predict the harmfulness of both queries and partial responses using specialized prompts, enabling early termination of potentially harmful interactions. Extensive experiments across multiple LLM architectures (LLaMA-2-7B, Qwen-2.5-3B-Instruct, and LLaMA-3.2-3B-Instruct) demonstrate that MetaDefense significantly outperforms existing defense mechanisms, achieving robust defense against harmful queries with seen and unseen attack templates while maintaining competitive performance on benign tasks. Code is available at https://github.com/ws-jiang/MetaDefense.
Related papers
- Metaphor-based Jailbreaking Attacks on Text-to-Image Models [41.420325236578755]
textbfMJA is a textbfmetaphor-based textbfjailbreaking textbfattack method inspired by the Taboo game.<n>It efficiently attacks diverse defense mechanisms without prior knowledge of their type.
arXiv Detail & Related papers (2025-12-06T12:38:00Z) - Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations [0.9732319879728966]
Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior.<n>This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions.
arXiv Detail & Related papers (2025-11-24T09:38:11Z) - LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution [84.2846064139183]
Large Language Models (LLMs) face threats from jailbreak prompts.<n>We propose LightDefense, a lightweight defense mechanism targeted at white-box models.
arXiv Detail & Related papers (2025-04-02T09:21:26Z) - FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks [7.31505609352525]
Defense in large language models (LLMs) is crucial to counter the numerous attackers exploiting these systems to generate harmful content.<n>We propose a moving target defense approach that alters decoding hyper parameters to enhance model robustness.<n>Our results demonstrate that our defense is the most effective against jailbreak attacks in three of the models tested.
arXiv Detail & Related papers (2024-12-10T17:02:28Z) - The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise.<n>Recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations.
arXiv Detail & Related papers (2024-11-13T07:57:19Z) - HSF: Defending against Jailbreak Attacks with Hidden State Filtering [14.031010511732008]
We propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF)<n>HSF enables the model to preemptively identify and reject adversarial inputs before the inference process begins.<n>It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries.
arXiv Detail & Related papers (2024-08-31T06:50:07Z) - SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner [21.414701448926614]
This paper introduces a generic LLM jailbreak defense framework called SelfDefend.<n>We empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks.<n>To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models.
arXiv Detail & Related papers (2024-06-08T15:45:31Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks [20.5016054418053]
AutoDefense is a multi-agent defense framework that filters harmful responses from large language models.
Our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models.
arXiv Detail & Related papers (2024-03-02T16:52:22Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.