Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models
- URL: http://arxiv.org/abs/2501.02029v1
- Date: Fri, 03 Jan 2025 07:01:15 GMT
- Title: Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models
- Authors: Ziwei Zheng, Junyao Zhao, Le Yang, Lijun He, Fan Li,
- Abstract summary: Internal activations of large vision-language models (LVLMs) can identify malicious prompts across different attacks.
This inherent safety perception is governed by sparse attention heads, which we term safety heads"
By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector.
- Score: 9.318094073527563
- License:
- Abstract: With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.
Related papers
- HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States [17.601328965546617]
We investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference.
Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts.
We introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety.
arXiv Detail & Related papers (2025-02-20T17:14:34Z) - Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states.
Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z) - Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication.
This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z) - SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.
Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.
We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD)
PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques.
PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z) - SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance [48.36220909956064]
SafeAligner is a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks.
We develop two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses.
We show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones.
arXiv Detail & Related papers (2024-06-26T07:15:44Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - Uncovering Safety Risks of Large Language Models through Concept Activation Vector [13.804245297233454]
We introduce a Safety Concept Activation Vector (SCAV) framework to guide attacks on large language models (LLMs)
We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks.
Our attack method significantly improves the attack success rate and response quality while requiring less training data.
arXiv Detail & Related papers (2024-04-18T09:46:25Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.
Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.
We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment [31.24530091590395]
We study an attack scenario called Trojan Activation Attack (TA2), which injects trojan steering vectors into the activation layers of Large Language Models.
Our experiment results show that TA2 is highly effective and adds little or no overhead to attack efficiency.
arXiv Detail & Related papers (2023-11-15T23:07:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.