Related papers: TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

URL: http://arxiv.org/abs/2602.06630v1
Date: Fri, 06 Feb 2026 11:43:56 GMT
Title: TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking
Authors: Mengyao Du, Han Fang, Haokai Ma, Gang Yang, Quanjun Yin, Shouling Ji, Ee-Chien Chang,
Abstract summary: Suffix-based jailbreak attacks append an adversarial suffix, i.e., a short token sequence, to steer aligned LLMs into unsafe outputs.<n>We propose TrapSuffix, a lightweight fine-tuning approach that injects trap-aligned behaviors into the base model without changing the inference pipeline.<n>Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average attack success rate to below 0.01 percent and achieves an average tracing success rate of 87.9 percent.
Score: 52.72486831074384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Suffix-based jailbreak attacks append an adversarial suffix, i.e., a short token sequence, to steer aligned LLMs into unsafe outputs. Since suffixes are free-form text, they admit endlessly many surface forms, making jailbreak mitigation difficult. Most existing defenses depend on passive detection of suspicious suffixes, without leveraging the defender's inherent asymmetric ability to inject secrets and proactively conceal gaps. Motivated by this, we take a controllability-oriented perspective and develop a proactive defense that nudges attackers into a no-win dilemma: either they fall into defender-designed optimization traps and fail to produce an effective adversarial suffix, or they can succeed only by generating adversarial suffixes that carry distinctive, traceable fingerprints. We propose TrapSuffix, a lightweight fine-tuning approach that injects trap-aligned behaviors into the base model without changing the inference pipeline. TrapSuffix channels jailbreak attempts into these two outcomes by reshaping the model's response landscape to adversarial suffixes. Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average attack success rate to below 0.01 percent and achieves an average tracing success rate of 87.9 percent, providing both strong defense and reliable traceability. It introduces no inference-time overhead and incurs negligible memory cost, requiring only 15.87 MB of additional memory on average, whereas state-of-the-art LLM-based detection defenses typically incur memory overheads at the 1e4 MB level, while composing naturally with existing filtering-based defenses for complementary protection.

Related papers

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections [74.60337113759313]
Current defenses against jailbreaks and prompt injections are typically evaluated against a static set of harmful attack strings.<n>We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design.
arXiv Detail & Related papers (2025-10-10T05:51:04Z)
Proactive defense against LLM Jailbreak [28.249786308207046]
ProAct is a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes.<n>Our method consistently and significantly reduces attack success rates by up to 92%.
arXiv Detail & Related papers (2025-10-06T17:32:40Z)
Mitigating Jailbreaks with Intent-Aware LLMs [42.48292327349576]
Large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions.<n>We propose Intent-FT, a simple and lightweight fine-tuning approach that explicitly trains LLMs to infer the underlying intent of an instruction before responding.<n> Empirically, Intent-FT consistently mitigates all evaluated attack categories, with no single attack exceeding a 50% success rate.
arXiv Detail & Related papers (2025-08-16T15:03:33Z)
Adversarial Suffix Filtering: a Defense Pipeline for LLMs [0.7366405857677227]
Adversarial suffixes are considered to be the current state-of-the-art jailbreak.<n>ASF functions as an input preprocessor and sanitizer that detects and filters adversarially crafted suffixes in prompts.<n>We demonstrate that ASF provides comprehensive defense capabilities across both black-box and white-box attack settings.
arXiv Detail & Related papers (2025-05-14T17:52:10Z)
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification [18.006622965818856]
We introduce DETAM, a finetuning-free defense approach that improves the defensive capabilities against jailbreak attacks of LLMs.<n>Specifically, we analyze the differences in attention scores between successful and unsuccessful defenses to identify the attention heads sensitive to jailbreak attacks.<n>During inference, we reallocate attention to emphasize the user's core intention, minimizing interference from attack tokens.
arXiv Detail & Related papers (2025-04-18T09:02:12Z)
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP [51.04452017089568]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective defense mechanism that operates on text prompts to indirectly purify CLIP.<n>CBPT significantly mitigates backdoor threats while preserving model utility.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs [4.534938642552179]
ShieldLearner is a novel paradigm that mimics human learning in defense.<n>Through trial and error, it autonomously distills attack signatures into a Pattern Atlas.<n> Adaptive Adversarial Augmentation generates adversarial variations of successfully defended prompts.
arXiv Detail & Related papers (2025-02-16T18:47:41Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks. We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.