One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
- URL: http://arxiv.org/abs/2505.07167v2
- Date: Mon, 04 Aug 2025 06:22:49 GMT
- Title: One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
- Authors: Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin,
- Abstract summary: Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research.<n>We propose textttD-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM.
- Score: 20.42976162135529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research. However, they remain vulnerable to jailbreak attacks, which manipulate the models into generating harmful responses despite safety alignment. Recent studies have shown that current safety-aligned LLMs often undergo the shallow safety alignment, where the first few tokens largely determine whether the response will be harmful. Through comprehensive observations, we find that safety-aligned LLMs and various defense strategies generate highly similar initial tokens in their refusal responses, which we define as safety trigger tokens. Building on this insight, we propose \texttt{D-STT}, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM to trigger the model's learned safety patterns. In this process, the safety trigger is constrained to a single token, which effectively preserves model usability by introducing minimum intervention in the decoding process. Extensive experiments across diverse jailbreak attacks and benign prompts demonstrate that \ours significantly reduces output harmfulness while preserving model usability and incurring negligible response time overhead, outperforming ten baseline methods.
Related papers
- Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism [123.54980913741828]
Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning.<n>MLLMs are susceptible to multimodal jailbreak attacks and hindering their safe deployment.<n>We propose Safe Prune-then-Restore (SafePTR), a training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.
arXiv Detail & Related papers (2025-07-02T09:22:03Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models [31.35788474507371]
Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks.<n>We present STShield, a lightweight framework for real-time jailbroken judgement.
arXiv Detail & Related papers (2025-03-23T04:23:07Z) - Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs? [0.836362570897926]
We investigate existing methods for such generalization and find them insufficient.<n>To avoid performance degradation and preserve safe performance, we advocate for a two-step framework.<n>We find that the final hidden state for the last token is enough to provide robust performance.
arXiv Detail & Related papers (2025-02-22T10:31:50Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models [9.318094073527563]
Internal activations of large vision-language models (LVLMs) can identify malicious prompts across different attacks.<n>This inherent safety perception is governed by sparse attention heads, which we term safety heads"<n>By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector.
arXiv Detail & Related papers (2025-01-03T07:01:15Z) - Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.476222570886483]
Large language models (LLMs) have demonstrated immense utility across various industries.<n>As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts.<n>This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z) - SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance [48.36220909956064]
SafeAligner is a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks.<n>We develop two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses.<n>We show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones.
arXiv Detail & Related papers (2024-06-26T07:15:44Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.