GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset
- URL: http://arxiv.org/abs/2506.19054v2
- Date: Thu, 26 Jun 2025 03:18:08 GMT
- Title: GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset
- Authors: Mintong Kang, Zhaorun Chen, Chejian Xu, Jiawei Zhang, Chengquan Guo, Minzhou Pan, Ivan Revilla, Yu Sun, Bo Li,
- Abstract summary: We introduce GuardSet-X, the first massive multi-domain safety policy-grounded guardrail dataset.<n> GuardSet-X offers broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen.<n>We benchmark 19 advanced guardrail models and uncover a series of findings.
- Score: 18.306944278068638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As LLMs become widespread across diverse applications, concerns about the security and safety of LLM interactions have intensified. Numerous guardrail models and benchmarks have been developed to ensure LLM content safety. However, existing guardrail benchmarks are often built upon ad hoc risk taxonomies that lack a principled grounding in standardized safety policies, limiting their alignment with real-world operational requirements. Moreover, they tend to overlook domain-specific risks, while the same risk category can carry different implications across different domains. To bridge these gaps, we introduce GuardSet-X, the first massive multi-domain safety policy-grounded guardrail dataset. GuardSet-X offers: (1) broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen; (2) policy-grounded risk construction based on authentic, domain-specific safety guidelines; (3) diverse interaction formats, encompassing declarative statements, questions, instructions, and multi-turn conversations; (4) advanced benign data curation via detoxification prompting to challenge over-refusal behaviors; and (5) \textbf{attack-enhanced instances} that simulate adversarial inputs designed to bypass guardrails. Based on GuardSet-X, we benchmark 19 advanced guardrail models and uncover a series of findings, such as: (1) All models achieve varied F1 scores, with many demonstrating high variance across risk categories, highlighting their limited domain coverage and insufficient handling of domain-specific safety concerns; (2) As models evolve, their coverage of safety risks broadens, but performance on common risk categories may decrease; (3) All models remain vulnerable to optimized adversarial attacks. We believe that \dataset and the unique insights derived from our evaluations will advance the development of policy-aligned and resilient guardrail systems.
Related papers
- ExpGuard: LLM Content Moderation in Specialized Domains [46.00867862478331]
Current guardrail models predominantly address general human-LLM interactions.<n>We introduce ExpGuard, a robust guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains.<n>We present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses.
arXiv Detail & Related papers (2026-03-03T04:09:49Z) - Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought [5.251527748612469]
Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies.<n>We present textbfPACT (Prompt-Thought Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning.
arXiv Detail & Related papers (2026-02-06T12:20:01Z) - Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models [50.91504059485288]
We propose a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously.<n>We develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching.
arXiv Detail & Related papers (2026-01-22T09:32:43Z) - YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models [36.084240131323824]
We present YuFeng-XGuard, a reasoning-centric guardrail model family for large language models (LLMs)<n>Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and confidence scores.<n>We introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining.
arXiv Detail & Related papers (2026-01-22T02:23:18Z) - DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI Agents [12.054307827384415]
Large Language Models (LLMs) have become increasingly prominent, severely constraining their trustworthy deployment in critical domains.<n>This paper proposes a novel safety response framework designed to safeguard LLMs at both the input and output levels.
arXiv Detail & Related papers (2025-11-05T03:04:35Z) - Qwen3Guard Technical Report [127.69960525219051]
We present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants.<n>Generative Qwen3Guard casts safety classification as an instruction-following task to enable fine-grained tri-class judgments.<n>Stream Qwen3Guard introduces a token-level classification head for real-time safety monitoring.
arXiv Detail & Related papers (2025-10-16T04:00:18Z) - Building a Foundational Guardrail for General Agentic Systems via Synthetic Data [76.18834864749606]
LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm.<n>Existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level.<n>We introduce AuraGen, a controllable engine that synthesizes benign trajectories, injects category-labeled risks with difficulty, and filters outputs via an automated reward model.
arXiv Detail & Related papers (2025-10-10T18:42:32Z) - GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners [60.49708196646694]
Large language models (LLMs) are increasingly integrated into numerous applications across various domains.<n>In this paper, we propose GSPR, a Generalizable Safety Reasoner to identify unsafe input prompts and LLMs' outputs with violated safety.<n>Our GSPR significantly improves existing safety guardrails' reasoning capabilities for both safety and category prediction tasks.
arXiv Detail & Related papers (2025-09-29T08:07:45Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models [17.824240702928133]
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities.<n>Existing safety alignment approaches fall short in addressing the complex and nuanced threats posed by multimodal inputs.<n>MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities.
arXiv Detail & Related papers (2025-06-24T02:37:59Z) - Exploring the Secondary Risks of Large Language Models [17.845215420030467]
We introduce secondary risks marked by harmful or misleading behaviors during benign prompts.<n>Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms.<n>We propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors.
arXiv Detail & Related papers (2025-06-14T07:31:52Z) - Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z) - Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems [16.838103835766066]
Large language models (LLMs) are rapidly evolving into autonomous agents that cooperate across organizational boundaries.<n>This position paper maps the security agenda for cross-domain multi-agent LLM systems.
arXiv Detail & Related papers (2025-05-28T18:19:03Z) - Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios [77.86600052899156]
Large Language Model (LLM)-based agents are increasingly deployed in real-world applications.<n>We propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation.<n>We show that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks.
arXiv Detail & Related papers (2025-05-23T10:56:06Z) - Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - Safety Alignment Can Be Not Superficial With Explicit Safety Signals [8.297367440457508]
Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially.<n>This paper identifies a fundamental cause of this superficiality: existing alignment approaches presume that models can implicitly learn a safety-related reasoning task during the alignment process.<n>By explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity.
arXiv Detail & Related papers (2025-05-19T20:40:46Z) - SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment.<n>To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.<n>Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z) - AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts [0.0]
As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase.
We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas.
To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories.
arXiv Detail & Related papers (2024-04-09T03:54:28Z) - Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states.
In safety-critical domains, such behaviors could lead to disastrous outcomes.
We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z) - A Security Risk Taxonomy for Prompt-Based Interaction With Large Language Models [5.077431021127288]
This paper addresses a gap in current research by focusing on security risks posed by large language models (LLMs)
Our work proposes a taxonomy of security risks along the user-model communication pipeline and categorizes the attacks by target and attack type alongside the commonly used confidentiality, integrity, and availability (CIA) triad.
arXiv Detail & Related papers (2023-11-19T20:22:05Z) - Safe Model-Based Multi-Agent Mean-Field Reinforcement Learning [48.667697255912614]
Mean-field reinforcement learning addresses the policy of a representative agent interacting with the infinite population of identical agents.
We propose Safe-M$3$-UCRL, the first model-based mean-field reinforcement learning algorithm that attains safe policies even in the case of unknown transitions.
Our algorithm effectively meets the demand in critical areas while ensuring service accessibility in regions with low demand.
arXiv Detail & Related papers (2023-06-29T15:57:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.