Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs
- URL: http://arxiv.org/abs/2507.04365v1
- Date: Sun, 06 Jul 2025 12:19:04 GMT
- Title: Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs
- Authors: Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho,
- Abstract summary: We reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping.<n>We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning.<n>We propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling.
- Score: 61.916827858666906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention Slipping, with their effectiveness positively correlated with the degree of mitigation achieved. Inspired by this finding, we propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling. Experiments on four leading LLMs (Gemma2-9B-It, Llama3.1-8B-It, Qwen2.5-7B-It, Mistral-7B-It v0.2) show that our method effectively resists various jailbreak attacks while maintaining performance on benign tasks on AlpacaEval. Importantly, Attention Sharpening introduces no additional computational or memory overhead, making it an efficient and practical solution for real-world deployment.
Related papers
- DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification [18.006622965818856]
We introduce DETAM, a finetuning-free defense approach that improves the defensive capabilities against jailbreak attacks of LLMs.<n>Specifically, we analyze the differences in attention scores between successful and unsuccessful defenses to identify the attention heads sensitive to jailbreak attacks.<n>During inference, we reallocate attention to emphasize the user's core intention, minimizing interference from attack tokens.
arXiv Detail & Related papers (2025-04-18T09:02:12Z) - Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment [5.552439217633078]
It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety.<n>We present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt.
arXiv Detail & Related papers (2025-02-21T09:38:00Z) - xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.<n>We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)<n>We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [55.253208152184065]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.<n>We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.<n>We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z) - The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise.<n>Recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations.
arXiv Detail & Related papers (2024-11-13T07:57:19Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks [2.873719680183099]
This paper advocates for the significance of jailbreak attack prevention on Large Language Models (LLMs)
We introduce MoJE, a novel guardrail architecture designed to surpass current limitations in existing state-of-the-art guardrails.
MoJE excels in detecting jailbreak attacks while maintaining minimal computational overhead during model inference.
arXiv Detail & Related papers (2024-09-26T10:12:19Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.