HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
- URL: http://arxiv.org/abs/2506.04704v4
- Date: Thu, 06 Nov 2025 15:28:19 GMT
- Title: HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
- Authors: Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang,
- Abstract summary: We introduce a holistic safety dataset and benchmark, textbfHoliSafe, that spans all five safe/unsafe image-text combinations.<n>We also propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images.<n> Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks.
- Score: 58.12612140992874
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.
Related papers
- Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities [23.165174248333212]
Vision-language models (VLMs) are increasingly applied to identify unsafe or inappropriate images.<n>It is still unclear whether they can recognize various unsafe concepts when presented in different modalities, such as text and images.<n>We conduct a systematic evaluation of VLMs' perception (concept recognition) and alignment (ethical reasoning) capabilities.<n>We introduce a simplified reinforcement learning (RL)-based approach using proximal policy optimization (PPO) to strengthen the ability to identify unsafe concepts from images.
arXiv Detail & Related papers (2025-07-15T10:04:27Z) - The Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models [4.27794555931853]
Vision-Language Models (VLMs) face unique vulnerabilities due to their multimodal nature, allowing adversaries to bypass safety guardrails and trigger the generation of harmful content.<n>We propose The Safety Reminder'', a soft prompt tuning approach that optimize learnable prompt tokens, which are periodically injected during the text generation process to enhance safety awareness.
arXiv Detail & Related papers (2025-06-15T12:48:38Z) - Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z) - SafeVid: Toward Safety Aligned Video Large Multimodal Models [60.14535756294228]
We introduce SafeVid, a framework designed to instill video-specific safety principles in Video Large Multimodal Models (VLMMs)<n>SafeVid employs detailed textual video descriptions as an interpretive bridge, facilitating rule-driven safety reasoning.<n> Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements.
arXiv Detail & Related papers (2025-05-17T09:21:33Z) - Safe Vision-Language Models via Unsafe Weights Manipulation [75.04426753720551]
We revise safety evaluation by introducing Safe-Ground, a new set of metrics that evaluate safety at different levels of granularity.<n>We take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM)<n>UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter.
arXiv Detail & Related papers (2025-03-14T17:00:22Z) - Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs [56.440345471966666]
Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images.<n>This paper introduces MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios.<n> MMSafeAware includes both unsafe and over-safety subsets to assess models abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness.
arXiv Detail & Related papers (2025-02-16T16:12:40Z) - Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications.<n>Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness.<n>Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z) - Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [25.606641582511106]
We propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance.<n>Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks.
arXiv Detail & Related papers (2025-01-30T17:59:45Z) - Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities.<n>This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs.<n>To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z) - PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment [28.008884416277954]
We propose a progressive concept-based alignment strategy, PSA-VLM, to enhance visual modality safety alignment.<n>Our method achieves state-of-the-art results on popular VLM safety benchmark.
arXiv Detail & Related papers (2024-11-18T13:01:57Z) - How Does Vision-Language Adaptation Impact the Safety of Vision Language Models? [27.46416187893547]
Vision-Language adaptation (VL adaptation) transforms Large Language Models (LLMs) into Large Vision-Language Models (LVLMs)
Despite potential harmfulness due to weakened safety measures, in-depth analysis on the effects of VL adaptation on safety remains under-explored.
arXiv Detail & Related papers (2024-10-10T03:12:03Z) - Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.<n>For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.<n>We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z) - Safety Alignment for Vision Language Models [21.441662865727448]
We enhance the visual modality safety alignment of Vision Language Models (VLMs) by adding safety modules.
Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance.
arXiv Detail & Related papers (2024-05-22T12:21:27Z) - Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [39.56233272612982]
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to jailbreaking attacks.
Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning.
To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories.
arXiv Detail & Related papers (2024-02-03T16:43:42Z) - How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for
Vision LLMs [55.91371032213854]
This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning.
We introduce a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness.
arXiv Detail & Related papers (2023-11-27T18:59:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.