Towards Policy-Adaptive Image Guardrail: Benchmark and Method
- URL: http://arxiv.org/abs/2603.01228v1
- Date: Sun, 01 Mar 2026 18:59:21 GMT
- Title: Towards Policy-Adaptive Image Guardrail: Benchmark and Method
- Authors: Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, Shuigeng Zhou,
- Abstract summary: Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails.<n>Existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy.<n>We introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards for robust unsafe-image guardrails.
- Score: 21.041111216560545
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.
Related papers
- Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought [5.251527748612469]
Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies.<n>We present textbfPACT (Prompt-Thought Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning.
arXiv Detail & Related papers (2026-02-06T12:20:01Z) - SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge [51.634837361795434]
SaFeR-CLIP reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods.<n>We also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift.
arXiv Detail & Related papers (2025-11-20T19:00:15Z) - SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability [49.074914896839466]
We introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency.<n>Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function.<n>We show that SafeVision achieves state-of-the-art performance on different benchmarks.
arXiv Detail & Related papers (2025-10-28T00:35:59Z) - SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation [5.313750874857107]
We introduce SafetyPairs, a framework for generating counterfactual pairs of images that differ only in the features relevant to the given safety policy.<n>Using SafetyPairs, we construct a new safety benchmark, which serves as a powerful source of evaluation data.<n>We release a benchmark containing over 3,020 SafetyPair images spanning a diverse taxonomy of 9 safety categories.
arXiv Detail & Related papers (2025-10-24T03:19:48Z) - Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities [23.165174248333212]
Vision-language models (VLMs) are increasingly applied to identify unsafe or inappropriate images.<n>It is still unclear whether they can recognize various unsafe concepts when presented in different modalities, such as text and images.<n>We conduct a systematic evaluation of VLMs' perception (concept recognition) and alignment (ethical reasoning) capabilities.<n>We introduce a simplified reinforcement learning (RL)-based approach using proximal policy optimization (PPO) to strengthen the ability to identify unsafe concepts from images.
arXiv Detail & Related papers (2025-07-15T10:04:27Z) - GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset [18.306944278068638]
We introduce GuardSet-X, the first massive multi-domain safety policy-grounded guardrail dataset.<n> GuardSet-X offers broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen.<n>We benchmark 19 advanced guardrail models and uncover a series of findings.
arXiv Detail & Related papers (2025-06-18T01:35:33Z) - HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model [58.12612140992874]
We introduce a holistic safety dataset and benchmark, textbfHoliSafe, that spans all five safe/unsafe image-text combinations.<n>We also propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images.<n> Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks.
arXiv Detail & Related papers (2025-06-05T07:26:34Z) - Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - Safe Vision-Language Models via Unsafe Weights Manipulation [75.04426753720551]
We revise safety evaluation by introducing Safe-Ground, a new set of metrics that evaluate safety at different levels of granularity.<n>We take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM)<n>UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter.
arXiv Detail & Related papers (2025-03-14T17:00:22Z) - CSPI-MT: Calibrated Safe Policy Improvement with Multiple Testing for Threshold Policies [30.57323631122579]
We focus on threshold policies, a ubiquitous class of policies with applications in economics, healthcare, and digital advertising.
Existing methods rely on potentially underpowered safety checks and limit the opportunities for finding safe improvements.
We show that in adversarial settings, our approach controls the rate of adopting a policy worse than the baseline to the pre-specified error level.
arXiv Detail & Related papers (2024-08-21T21:38:03Z) - Safe Reinforcement Learning via Confidence-Based Filters [78.39359694273575]
We develop a control-theoretic approach for certifying state safety constraints for nominal policies learned via standard reinforcement learning techniques.
We provide formal safety guarantees, and empirically demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-07-04T11:43:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.