VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap
- URL: http://arxiv.org/abs/2502.10486v1
- Date: Fri, 14 Feb 2025 08:44:43 GMT
- Title: VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap
- Authors: Qin Liu, Fei Wang, Chaowei Xiao, Muhao Chen,
- Abstract summary: Vision language models (VLMs) come with increased safety concerns.
VLMs can be built upon LLMs that have textual safety alignment, but it is easily undermined when the vision modality is integrated.
We propose VLM-Guard, an inference-time intervention strategy that leverages the LLM component of a VLM as supervision for the safety alignment of the VLM.
- Score: 51.287157951953226
- License:
- Abstract: The emergence of vision language models (VLMs) comes with increased safety concerns, as the incorporation of multiple modalities heightens vulnerability to attacks. Although VLMs can be built upon LLMs that have textual safety alignment, it is easily undermined when the vision modality is integrated. We attribute this safety challenge to the modality gap, a separation of image and text in the shared representation space, which blurs the distinction between harmful and harmless queries that is evident in LLMs but weakened in VLMs. To avoid safety decay and fulfill the safety alignment gap, we propose VLM-Guard, an inference-time intervention strategy that leverages the LLM component of a VLM as supervision for the safety alignment of the VLM. VLM-Guard projects the representations of VLM into the subspace that is orthogonal to the safety steering direction that is extracted from the safety-aligned LLM. Experimental results on three malicious instruction settings show the effectiveness of VLM-Guard in safeguarding VLM and fulfilling the safety alignment gap between VLM and its LLM component.
Related papers
- Large Language Model Supply Chain: Open Problems From the Security Perspective [25.320736806895976]
Large Language Model (LLM) is changing the software development paradigm and has gained huge attention from both academia and industry.
We take the first step to discuss the potential security risks in each component as well as the integration between components of LLM SC.
arXiv Detail & Related papers (2024-11-03T15:20:21Z) - Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models [26.83278034227966]
The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module.
We show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs.
To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM)
arXiv Detail & Related papers (2024-10-11T17:59:31Z) - CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs.
The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs.
We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z) - Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD)
PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques.
PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z) - Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks [41.213482317141356]
Augmenting Large Language Models with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs)
In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinct modeling approach.
arXiv Detail & Related papers (2024-05-07T15:29:48Z) - Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [98.02846901473697]
We propose ECSO (Eyes Closed, Safety On), a training-free protecting approach that exploits the inherent safety awareness of MLLMs.
ECSO generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs.
arXiv Detail & Related papers (2024-03-14T17:03:04Z) - ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors [90.73444232283371]
ShieldLM is a safety detector for Large Language Models (LLMs) that aligns with common safety standards.
We show that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability.
arXiv Detail & Related papers (2024-02-26T09:43:02Z) - MART: Improving LLM Safety with Multi-round Automatic Red-Teaming [72.2127916030909]
We propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation.
On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART.
Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following.
arXiv Detail & Related papers (2023-11-13T19:13:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.