Related papers: SGuard-v1: Safety Guardrail for Large Language Models

SGuard-v1: Safety Guardrail for Large Language Models

URL: http://arxiv.org/abs/2511.12497v1
Date: Sun, 16 Nov 2025 08:15:54 GMT
Title: SGuard-v1: Safety Guardrail for Large Language Models
Authors: JoonHo Lee, HyeonMin Cho, Jaewoong Yun, Hyunjae Lee, JunKyu Lee, Juree Seok,
Abstract summary: SGuard-v1 is a lightweight safety guardrail for Large Language Models (LLMs)<n>It comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings.
Score: 9.229602223310485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.

Related papers

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models [50.91504059485288]
We propose a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously.<n>We develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching.
arXiv Detail & Related papers (2026-01-22T09:32:43Z)
Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis [48.70474961584997]
Indirect prompt injection attacks (IPIAs) pose a critical threat to large language models (LLMs)<n>We present IntentGuard, a general defense framework based on instruction-following intent analysis.
arXiv Detail & Related papers (2025-11-30T16:29:04Z)
Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks [0.31984926651189866]
Sentra-Guard is a real-time modular defense system for large language models (LLMs)<n>The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts.<n>It identifies adversarial prompts in both direct and obfuscated attack vectors.
arXiv Detail & Related papers (2025-10-26T11:19:47Z)
OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models [3.3252656373741547]
We present OpenGuardrails, the first fully open-source platform that unifies large-model-based safety detection, manipulation defense, and deployable guardrail infrastructure.<n>OpenGuardrails protects against three major classes of risks: (1) content-safety violations such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information.
arXiv Detail & Related papers (2025-10-22T02:02:27Z)
Qwen3Guard Technical Report [127.69960525219051]
We present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants.<n>Generative Qwen3Guard casts safety classification as an instruction-following task to enable fine-grained tri-class judgments.<n>Stream Qwen3Guard introduces a token-level classification head for real-time safety monitoring.
arXiv Detail & Related papers (2025-10-16T04:00:18Z)
RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts [39.58550043591753]
External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs.<n>We investigated how robust LLM-based guardrails are against additional information embedded in the context.
arXiv Detail & Related papers (2025-10-06T19:20:43Z)
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model [58.12612140992874]
We introduce a holistic safety dataset and benchmark, textbfHoliSafe, that spans all five safe/unsafe image-text combinations.<n>We also propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images.<n> Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks.
arXiv Detail & Related papers (2025-06-05T07:26:34Z)
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.<n>WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z)
LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models [26.148022772521493]
LlavaGuard is a suite of VLM-based vision safeguards that address the critical need for reliable guardrails in the era of large-scale data and models.<n>For teaching a VLM safeguard on safety, we create a multimodal safety dataset with high-quality human expert annotations.<n>The resulting LlavaGuard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies.
arXiv Detail & Related papers (2024-06-07T17:44:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.