SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations
- URL: http://arxiv.org/abs/2412.06878v1
- Date: Mon, 09 Dec 2024 18:59:04 GMT
- Title: SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations
- Authors: Zhaorun Chen, Francesco Pinto, Minzhou Pan, Bo Li,
- Abstract summary: We propose SafeWatch, an efficient MLLM-based video guardrail model to follow customized safety policies.<n>Unlike traditional MLLM-based guardrails that encode all safety policies autoregressively, SafeWatch uniquely encodes each policy chunk in parallel.<n>In addition, SafeWatch incorporates a policy-aware visual token pruning algorithm that adaptively selects the most relevant video tokens for each policy.
- Score: 10.451619858527897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rise of generative AI and rapid growth of high-quality video generation, video guardrails have become more crucial than ever to ensure safety and security across platforms. Current video guardrails, however, are either overly simplistic, relying on pure classification models trained on simple policies with limited unsafe categories, which lack detailed explanations, or prompting multimodal large language models (MLLMs) with long safety guidelines, which are inefficient and impractical for guardrailing real-world content. To bridge this gap, we propose SafeWatch, an efficient MLLM-based video guardrail model designed to follow customized safety policies and provide multi-label video guardrail outputs with content-specific explanations in a zero-shot manner. In particular, unlike traditional MLLM-based guardrails that encode all safety policies autoregressively, causing inefficiency and bias, SafeWatch uniquely encodes each policy chunk in parallel and eliminates their position bias such that all policies are attended simultaneously with equal importance. In addition, to improve efficiency and accuracy, SafeWatch incorporates a policy-aware visual token pruning algorithm that adaptively selects the most relevant video tokens for each policy, discarding noisy or irrelevant information. This allows for more focused, policy-compliant guardrail with significantly reduced computational overhead. Considering the limitations of existing video guardrail benchmarks, we propose SafeWatch-Bench, a large-scale video guardrail benchmark comprising over 2M videos spanning six safety categories which covers over 30 tasks to ensure a comprehensive coverage of all potential safety scenarios. SafeWatch outperforms SOTA by 28.2% on SafeWatch-Bench, 13.6% on benchmarks, cuts costs by 10%, and delivers top-tier explanations validated by LLM and human reviews.
Related papers
- iSafetyBench: A video-language benchmark for safety in industrial environment [6.697702130929693]
iSafetyBench is a new video-language benchmark designed to evaluate model performance in industrial environments.<n>iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings.<n>We evaluate eight state-of-the-art video-language models under zero-shot conditions.
arXiv Detail & Related papers (2025-08-01T07:55:53Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model [52.72318433518926]
Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content.<n>We introduce a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations.<n>We propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head.
arXiv Detail & Related papers (2025-06-05T07:26:34Z) - From Evaluation to Defense: Advancing Safety in Video Large Language Models [33.10355085086974]
We introduce textbfVideoSafetyBench (VSB-77k) - the first large-scale, culturally diverse benchmark for Video LLM safety.<n> integrating video modality degrades safety performance by an average of 42.3%, exposing systemic risks in multimodal attack exploitation.<n>We propose textbfVideoSafety-R1, a dual-stage framework achieving unprecedented safety gains through two innovations.
arXiv Detail & Related papers (2025-05-22T13:16:53Z) - SafeVid: Toward Safety Aligned Video Large Multimodal Models [60.14535756294228]
We introduce SafeVid, a framework designed to instill video-specific safety principles in Video Large Multimodal Models (VLMMs)<n>SafeVid employs detailed textual video descriptions as an interpretive bridge, facilitating rule-driven safety reasoning.<n> Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements.
arXiv Detail & Related papers (2025-05-17T09:21:33Z) - Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs [51.90597846977058]
Video-SafetyBench is the first benchmark designed to evaluate the safety of LVLMs under video-text attacks.<n>It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories.<n>To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images and motion text.
arXiv Detail & Related papers (2025-05-17T05:06:38Z) - One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models [20.42976162135529]
Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research.<n>We propose textttD-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM.
arXiv Detail & Related papers (2025-05-12T01:26:50Z) - Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models [34.66687625996389]
Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks.
How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards?
We propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimize helpfulness and safety.
arXiv Detail & Related papers (2025-03-22T07:40:20Z) - Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs? [0.836362570897926]
We investigate existing methods for such generalization and find them insufficient.
To avoid performance degradation and preserve safe performance, we advocate for a two-step framework.
We find that the final hidden state for the last token is enough to provide robust performance.
arXiv Detail & Related papers (2025-02-22T10:31:50Z) - Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs [56.440345471966666]
Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images.
This paper introduces MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios.
MMSafeAware includes both unsafe and over-safety subsets to assess models abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness.
arXiv Detail & Related papers (2025-02-16T16:12:40Z) - Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states.
Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z) - MLLM-as-a-Judge for Image Safety without Human Labeling [81.24707039432292]
In the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content.
It is crucial to identify such unsafe images based on established safety rules.
Existing approaches typically fine-tune MLLMs with human-labeled datasets.
arXiv Detail & Related papers (2024-12-31T00:06:04Z) - RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting [7.0595410083835315]
RapGuard is a novel framework that uses multimodal chain-of-thought reasoning to generate scenario-specific safety prompts.
RapGuard achieves state-of-the-art safety performance, significantly reducing harmful content without degrading the quality of responses.
arXiv Detail & Related papers (2024-12-25T08:31:53Z) - SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types [21.683010095703832]
We develop a novel benchmark to assess the generalization of large language model (LLM) safety across various tasks and prompt types.
This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety.
Our assessment reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment.
arXiv Detail & Related papers (2024-10-29T11:47:01Z) - ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time [12.160713548659457]
adversarial visual inputs can easily bypass VLM defense mechanisms.
We propose a novel two-phase inference-time alignment framework, evaluating input visual contents and output responses.
Experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency.
arXiv Detail & Related papers (2024-10-09T07:21:43Z) - CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs.
The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs.
We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z) - PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing [1.474945380093949]
Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance.
Current methods struggle in balancing safety with helpfulness.
We propose PrimeGuard, a novel ITG method that utilizes structured control flow.
arXiv Detail & Related papers (2024-07-23T09:14:27Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models [39.15695612766001]
We introduce T2VSafetyBench, a new benchmark for safety-critical assessments of text-to-video models.
We define 12 critical aspects of video generation safety and construct a malicious prompt dataset.
No single model excels in all aspects, with different models showing various strengths.
There is a trade-off between the usability and safety of text-to-video generative models.
arXiv Detail & Related papers (2024-07-08T14:04:58Z) - Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching [77.36097118561057]
textscSafePatching is a novel framework for comprehensive and efficient PSA.
textscSafePatching achieves a more comprehensive and efficient PSA than baseline methods.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.