Related papers: Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap

Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap

URL: http://arxiv.org/abs/2505.24208v1
Date: Fri, 30 May 2025 04:40:08 GMT
Title: Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Authors: Wenhan Yang, Spencer Stice, Ali Payani, Baharan Mirzasoleiman,
Abstract summary: We show that the amount of modality gap is highly inversely correlated with Vision-Language Models' safety.<n>Inspired by this observation, we propose a regularization to reduce the modality gap during pretraining.<n>Our method substantially improves safety alignment of LVLMs, reducing unsafe rate by up to 16.3% without compromising performance, and can further boost existing defenses by up to 18.2%.
Score: 43.31975448647118
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring Vision-Language Models (VLMs) generate safe outputs is crucial for their reliable deployment. However, LVLMs suffer from drastic safety degradation compared to their LLM backbone. Even blank or irrelevant images can trigger LVLMs to generate harmful responses to prompts that would otherwise be refused in text-only contexts. The modality gap between image and text representations has been recently hypothesized to contribute to safety degradation of LVLMs. However, if and how the amount of modality gap affects LVLMs' safety is not studied. In this work, we show that the amount of modality gap is highly inversely correlated with VLMs' safety. Then, we show that this modality gap is introduced during pretraining LVLMs and persists through fine-tuning. Inspired by this observation, we propose a regularization to reduce the modality gap during pretraining. Our extensive experiments on LLaVA v1.5, ShareGPT4V, and MiniGPT-4 show that our method substantially improves safety alignment of LVLMs, reducing unsafe rate by up to 16.3% without compromising performance, and can further boost existing defenses by up to 18.2%.

Related papers

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning [61.594212398272184]
Low-Rank Extrapolation (LoX) improves robustness against benign and malicious fine-tuning attacks.<n>LoX leads to 11% to 54% absolute reductions in attack success rates.
arXiv Detail & Related papers (2025-06-18T16:30:02Z)
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model [58.12612140992874]
We introduce a holistic safety dataset and benchmark, textbfHoliSafe, that spans all five safe/unsafe image-text combinations.<n>We also propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images.<n> Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks.
arXiv Detail & Related papers (2025-06-05T07:26:34Z)
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models [50.89022445197919]
Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs)<n>Recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment.
arXiv Detail & Related papers (2025-05-26T08:25:25Z)
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning [23.71517734919702]
Vision-language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs.<n>Current alignment strategies rely on supervised safety fine-tuning with curated datasets.<n>We show that supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses.
arXiv Detail & Related papers (2025-03-14T19:52:08Z)
Understanding and Rectifying Safety Perception Distortion in VLMs [19.239094089025095]
Vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality.<n> multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts.<n>We propose ShiftDC, a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety.
arXiv Detail & Related papers (2025-02-18T18:06:48Z)
VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap [51.287157951953226]
Vision language models (VLMs) come with increased safety concerns.<n>VLMs can be built upon LLMs that have textual safety alignment, but it is easily undermined when the vision modality is integrated.<n>We propose VLM-Guard, an inference-time intervention strategy that leverages the LLM component of a VLM as supervision for the safety alignment of the VLM.
arXiv Detail & Related papers (2025-02-14T08:44:43Z)
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models [26.83278034227966]
The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module. We show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM)
arXiv Detail & Related papers (2024-10-11T17:59:31Z)
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs. The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs. We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z)
Uncovering Safety Risks of Large Language Models through Concept Activation Vector [13.804245297233454]
We introduce a Safety Concept Activation Vector (SCAV) framework to guide attacks on large language models (LLMs)<n>We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks.<n>Our attack method significantly improves the attack success rate and response quality while requiring less training data.
arXiv Detail & Related papers (2024-04-18T09:46:25Z)
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [98.02846901473697]
We propose ECSO (Eyes Closed, Safety On), a training-free protecting approach that exploits the inherent safety awareness of MLLMs. ECSO generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs.
arXiv Detail & Related papers (2024-03-14T17:03:04Z)
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [39.56233272612982]
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to jailbreaking attacks. Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning. To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories.
arXiv Detail & Related papers (2024-02-03T16:43:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.