Related papers: OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning

OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning

URL: http://arxiv.org/abs/2512.02306v1
Date: Tue, 02 Dec 2025 01:01:44 GMT
Title: OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning
Authors: Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie, Peng Qi, Muhao Chen,
Abstract summary: We propose OmniGuard, a family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability.<n>To support the training of OmniGuard, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples.<n>Experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios.
Score: 25.190494543355047
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.

Related papers

Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment [18.100656799320777]
We investigate a vulnerability in Omni-modal Large Language Models (OLLMs)<n>We propose OmniSteer, which modulates intervention intensity adaptively.<n>Experiments show that our method effectively preserves the general capabilities across all modalities.
arXiv Detail & Related papers (2026-02-10T06:04:08Z)
ProGuard: Towards Proactive Multimodal Safeguard [48.89789547707647]
ProGuard is a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks.<n>We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories.<n>We then train our vision-language base model purely through reinforcement learning to achieve efficient and concise reasoning.
arXiv Detail & Related papers (2025-12-29T16:13:23Z)
AprielGuard [2.3704817495377526]
Existing tools treat safety risks as separate problems, limiting robustness and generalizability.<n>We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework.<n> AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations.
arXiv Detail & Related papers (2025-12-23T12:01:32Z)
OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation [94.61617176929384]
OmniSafeBench-MM is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation.<n>It integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories.<n>By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research.
arXiv Detail & Related papers (2025-12-06T22:56:29Z)
Qwen3Guard Technical Report [127.69960525219051]
We present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants.<n>Generative Qwen3Guard casts safety classification as an instruction-following task to enable fine-grained tri-class judgments.<n>Stream Qwen3Guard introduces a token-level classification head for real-time safety monitoring.
arXiv Detail & Related papers (2025-10-16T04:00:18Z)
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems [4.404101728634984]
Protect is a multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs.<n>It integrates category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset.<n>Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels.
arXiv Detail & Related papers (2025-10-15T09:40:24Z)
Building a Foundational Guardrail for General Agentic Systems via Synthetic Data [76.18834864749606]
LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm.<n>Existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level.<n>We introduce AuraGen, a controllable engine that synthesizes benign trajectories, injects category-labeled risks with difficulty, and filters outputs via an automated reward model.
arXiv Detail & Related papers (2025-10-10T18:42:32Z)
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z)
GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset [18.306944278068638]
We introduce GuardSet-X, the first massive multi-domain safety policy-grounded guardrail dataset.<n> GuardSet-X offers broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen.<n>We benchmark 19 advanced guardrail models and uncover a series of findings.
arXiv Detail & Related papers (2025-06-18T01:35:33Z)
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model [58.12612140992874]
We introduce a holistic safety dataset and benchmark, textbfHoliSafe, that spans all five safe/unsafe image-text combinations.<n>We also propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images.<n> Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks.
arXiv Detail & Related papers (2025-06-05T07:26:34Z)
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback [34.01716144973483]
Multimodal large language models (MLLMs) are essential for building general-purpose AI assistants.<n>How can we ensure safety alignment of MLLMs to prevent undesired behaviors?<n>In this work, we present the first exploration of the Safe RLHF-V -- the first multimodal safety alignment framework.
arXiv Detail & Related papers (2025-03-22T07:40:20Z)
RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting [7.0595410083835315]
RapGuard is a novel framework that uses multimodal chain-of-thought reasoning to generate scenario-specific safety prompts.<n>RapGuard achieves state-of-the-art safety performance, significantly reducing harmful content without degrading the quality of responses.
arXiv Detail & Related papers (2024-12-25T08:31:53Z)
LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models [26.148022772521493]
LlavaGuard is a suite of VLM-based vision safeguards that address the critical need for reliable guardrails in the era of large-scale data and models.<n>For teaching a VLM safeguard on safety, we create a multimodal safety dataset with high-quality human expert annotations.<n>The resulting LlavaGuard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies.
arXiv Detail & Related papers (2024-06-07T17:44:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.