HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States
- URL: http://arxiv.org/abs/2502.14744v1
- Date: Thu, 20 Feb 2025 17:14:34 GMT
- Title: HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States
- Authors: Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue,
- Abstract summary: We investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference.
Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts.
We introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety.
- Score: 17.601328965546617
- License:
- Abstract: The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at https://github.com/leigest519/HiddenDetect.
Related papers
- Understanding and Rectifying Safety Perception Distortion in VLMs [19.239094089025095]
Vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality.
multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts.
We propose ShiftDC, a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety.
arXiv Detail & Related papers (2025-02-18T18:06:48Z) - Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states.
Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z) - Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update [8.739132798784777]
Vision-language models (VLMs) demonstrate strong multimodal capabilities but have been found to be more susceptible to generating harmful content.
We propose an textbfinternal activation revision approach that efficiently revises activations during generation.
Our framework incorporates revisions at both the layer and head levels, offering control over the model's generation at varying levels of granularity.
arXiv Detail & Related papers (2025-01-24T06:17:22Z) - Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models [9.318094073527563]
Internal activations of large vision-language models (LVLMs) can identify malicious prompts across different attacks.
This inherent safety perception is governed by sparse attention heads, which we term safety heads"
By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector.
arXiv Detail & Related papers (2025-01-03T07:01:15Z) - Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.
We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.
We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs.
The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs.
We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z) - SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.
Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.
We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks.
The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage.
In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z) - Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z) - Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes [0.0]
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents.
These models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks.
This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems.
arXiv Detail & Related papers (2024-04-05T20:31:45Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.
Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.
We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.