Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update
- URL: http://arxiv.org/abs/2501.16378v1
- Date: Fri, 24 Jan 2025 06:17:22 GMT
- Title: Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update
- Authors: Qing Li, Jiahui Geng, Zongxiong Chen, Kun Song, Lei Ma, Fakhri Karray,
- Abstract summary: Vision-language models (VLMs) demonstrate strong multimodal capabilities but have been found to be more susceptible to generating harmful content.<n>We propose an textbfinternal activation revision approach that efficiently revises activations during generation.<n>Our framework incorporates revisions at both the layer and head levels, offering control over the model's generation at varying levels of granularity.
- Score: 8.739132798784777
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Vision-language models (VLMs) demonstrate strong multimodal capabilities but have been found to be more susceptible to generating harmful content compared to their backbone large language models (LLMs). Our investigation reveals that the integration of images significantly shifts the model's internal activations during the forward pass, diverging from those triggered by textual input. Moreover, the safety alignments of LLMs embedded within VLMs are not sufficiently robust to handle the activations discrepancies, making the models vulnerable to even the simplest jailbreaking attacks. To address this issue, we propose an \textbf{internal activation revision} approach that efficiently revises activations during generation, steering the model toward safer outputs. Our framework incorporates revisions at both the layer and head levels, offering control over the model's generation at varying levels of granularity. In addition, we explore three strategies for constructing positive and negative samples and two approaches for extracting revision vectors, resulting in different variants of our method. Comprehensive experiments demonstrate that the internal activation revision method significantly improves the safety of widely used VLMs, reducing attack success rates by an average of 48.94\%, 34.34\%, 43.92\%, and 52.98\% on SafeBench, Safe-Unsafe, Unsafe, and MM-SafetyBench, respectively, while minimally impacting model helpfulness.
Related papers
- T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.
We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.
Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.
We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models [34.66687625996389]
Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks.
How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards?
We propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimize helpfulness and safety.
arXiv Detail & Related papers (2025-03-22T07:40:20Z) - HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States [17.601328965546617]
We investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference.
Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts.
We introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety.
arXiv Detail & Related papers (2025-02-20T17:14:34Z) - Understanding and Rectifying Safety Perception Distortion in VLMs [19.239094089025095]
Vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality.
multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts.
We propose ShiftDC, a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety.
arXiv Detail & Related papers (2025-02-18T18:06:48Z) - Model-Editing-Based Jailbreak against Safety-aligned Large Language Models [13.887770576598646]
Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions.<n>This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters.<n>TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions.
arXiv Detail & Related papers (2024-12-11T08:44:15Z) - Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.<n>We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment.<n>Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z) - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.