GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners
- URL: http://arxiv.org/abs/2509.24418v1
- Date: Mon, 29 Sep 2025 08:07:45 GMT
- Title: GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners
- Authors: Haoran Li, Yulin Chen, Jingru Zeng, Hao Peng, Huihao Jing, Wenbin Hu, Xi Yang, Ziqian Zeng, Sirui Han, Yangqiu Song,
- Abstract summary: Large language models (LLMs) are increasingly integrated into numerous applications across various domains.<n>In this paper, we propose GSPR, a Generalizable Safety Reasoner to identify unsafe input prompts and LLMs' outputs with violated safety.<n>Our GSPR significantly improves existing safety guardrails' reasoning capabilities for both safety and category prediction tasks.
- Score: 60.49708196646694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) are increasingly integrated into numerous applications across various domains, LLMs' safety becomes a critical concern for both application developers and intended users. Currently, great efforts have been made to develop safety benchmarks with fine-grained taxonomies. However, these benchmarks' taxonomies are disparate with different safety policies. Thus, existing safeguards trained on these benchmarks are either coarse-grained to only distinguish between safe and unsafe, or constrained by the narrow risk taxonomies of a single benchmark. To leverage these fine-grained safety taxonomies across multiple safety benchmarks, in this paper, we propose GSPR, a Generalizable Safety Policy Reasoner to identify unsafe input prompts and LLMs' outputs with violated safety taxonomies through Group Relative Policy Optimization (GRPO). Unlike prior safeguards which only cover a fixed set of risk factors, our GSPR incentivizes its reasoning capability with varied safety taxonomies through our careful cold-start strategy and reward design. Consequently, our GSPR can be trained across multiple safety benchmarks with distinct taxonomies and naturally exhibits powerful generalization ability. We conduct extensive experiments to show that our GSPR significantly improves existing safety guardrails' reasoning capabilities for both safety and category prediction tasks. Moreover, our GSPR not only demonstrates powerful safety generalization abilities but also achieves the least inference token costs with explanations.
Related papers
- Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety [59.01189713115365]
We evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases.<n>We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness.<n>We propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains.
arXiv Detail & Related papers (2026-01-12T21:08:46Z) - SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization [79.14563283347773]
Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instruction-following capabilities.<n>Cross-modal couplings can produce unsafe semantics even when individual inputs are benign.<n>We propose SafeGRPO, a self-rewarded multimodal safety alignment framework.
arXiv Detail & Related papers (2025-11-17T05:09:49Z) - DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI Agents [12.054307827384415]
Large Language Models (LLMs) have become increasingly prominent, severely constraining their trustworthy deployment in critical domains.<n>This paper proposes a novel safety response framework designed to safeguard LLMs at both the input and output levels.
arXiv Detail & Related papers (2025-11-05T03:04:35Z) - Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention [53.25106308403173]
We show that existing methods overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users.<n>We propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals.
arXiv Detail & Related papers (2025-09-29T07:41:09Z) - Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework [31.278770676774325]
We propose Safe-SAIL, a framework for interpreting SAE features within large language models (LLMs)<n>Our approach systematically identifies SAE with best concept-specific interpretability, explains safety-related neurons, and introduces efficient strategies to scale up the interpretation process.
arXiv Detail & Related papers (2025-09-11T11:22:43Z) - SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.