OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models
- URL: http://arxiv.org/abs/2510.19169v2
- Date: Wed, 29 Oct 2025 03:17:43 GMT
- Title: OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models
- Authors: Thomas Wang, Haowen Li,
- Abstract summary: We present OpenGuardrails, the first fully open-source platform that unifies large-model-based safety detection, manipulation defense, and deployable guardrail infrastructure.<n>OpenGuardrails protects against three major classes of risks: (1) content-safety violations such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information.
- Score: 3.3252656373741547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) are increasingly integrated into real-world applications, ensuring their safety, robustness, and privacy compliance has become critical. We present OpenGuardrails, the first fully open-source platform that unifies large-model-based safety detection, manipulation defense, and deployable guardrail infrastructure. OpenGuardrails protects against three major classes of risks: (1) content-safety violations such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information. Unlike prior modular or rule-based frameworks, OpenGuardrails introduces three core innovations: (1) a Configurable Policy Adaptation mechanism that allows per-request customization of unsafe categories and sensitivity thresholds; (2) a Unified LLM-based Guard Architecture that performs both content-safety and manipulation detection within a single model; and (3) a Quantized, Scalable Model Design that compresses a 14B dense base model to 3.3B via GPTQ while preserving over 98 of benchmark accuracy. The system supports 119 languages, achieves state-of-the-art performance across multilingual safety benchmarks, and can be deployed as a secure gateway or API-based service for enterprise use. All models, datasets, and deployment scripts are released under the Apache 2.0 license.
Related papers
- OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation [94.61617176929384]
OmniSafeBench-MM is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation.<n>It integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories.<n>By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research.
arXiv Detail & Related papers (2025-12-06T22:56:29Z) - CryptoTensors: A Light-Weight Large Language Model File Format for Highly-Secure Model Distribution [16.430668737524346]
We introduce CryptoTensors, a secure and format-compatible file structure for confidential LLM distribution.<n>Built as an extension to the widely adopted Safetensors format, CryptoTensors incorporates tensor-level encryption and embedded access control policies.<n>Our results highlight CryptoTensors as a light-weight, efficient, and developer-friendly solution for safeguarding LLM weights in real-world and widespread deployments.
arXiv Detail & Related papers (2025-12-04T08:49:22Z) - SGuard-v1: Safety Guardrail for Large Language Models [9.229602223310485]
SGuard-v1 is a lightweight safety guardrail for Large Language Models (LLMs)<n>It comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings.
arXiv Detail & Related papers (2025-11-16T08:15:54Z) - Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models [63.54707418559388]
We propose patching for large language models (LLMs) like software versions.<n>Our method enables rapid remediation by prepending a compact, learnable prefix to an existing model.
arXiv Detail & Related papers (2025-11-11T17:25:44Z) - Qwen3Guard Technical Report [127.69960525219051]
We present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants.<n>Generative Qwen3Guard casts safety classification as an instruction-following task to enable fine-grained tri-class judgments.<n>Stream Qwen3Guard introduces a token-level classification head for real-time safety monitoring.
arXiv Detail & Related papers (2025-10-16T04:00:18Z) - Bag of Tricks for Subverting Reasoning-based Safety Guardrails [62.139297207938036]
We present a bag of jailbreak methods that subvert the reasoning-based guardrails.<n>Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization.
arXiv Detail & Related papers (2025-10-13T16:16:44Z) - What You Code Is What We Prove: Translating BLE App Logic into Formal Models with LLMs for Vulnerability Detection [20.200451226371097]
This paper introduces a key insight: BLE application security analysis can be reframed as a semantic translation problem.<n>We leverage large language models (LLMs) not to directly detect vulnerabilities, but to serve as translators.<n>We implement this idea in VerifiaBLE, a system that combines static analysis, prompt-guided LLM translation, and symbolic verification.
arXiv Detail & Related papers (2025-09-11T09:27:37Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - LlamaFirewall: An open source guardrail system for building secure AI agents [0.5603362829699733]
Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks.<n>Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor.<n>We introduce LlamaFirewall, an open-source security focused guardrail framework.
arXiv Detail & Related papers (2025-05-06T14:34:21Z) - Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms [0.9091225937132784]
We reveal a critical control-plane attack surface to traditional data-plane vulnerabilities.<n>We introduce Constrained Decoding Attack, a novel jailbreak class that weaponizes structured output constraints to bypass safety mechanisms.<n>Our findings identify a critical security blind spot in current LLM architectures and urge a paradigm shift in LLM safety to address control-plane vulnerabilities.
arXiv Detail & Related papers (2025-03-31T15:08:06Z) - Zero-Trust Artificial Intelligence Model Security Based on Moving Target Defense and Content Disarm and Reconstruction [4.0208298639821525]
This paper examines the challenges in distributing AI models through model zoos and file transfer mechanisms.<n>The physical security of model files is critical, requiring stringent access controls and attack prevention solutions.<n>It demonstrates a 100% disarm rate while validated against known AI model repositories and actual malware attacks from the HuggingFace model zoo.
arXiv Detail & Related papers (2025-03-03T17:32:19Z) - TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation [31.231916859341865]
TrustRAG is a framework that systematically filters malicious and irrelevant content before it is retrieved for generation.<n>TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance.
arXiv Detail & Related papers (2025-01-01T15:57:34Z) - AlignGuard: Scalable Safety Alignment for Text-to-Image Generation [68.07258248467309]
Text-to-image (T2I) models are widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse.<n>In this work, we introduce AlignGuard, a method for safety alignment of T2I models.
arXiv Detail & Related papers (2024-12-13T18:59:52Z) - CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment [66.72332011814183]
CoreGuard is a computation- and communication-efficient protection method for proprietary large language models (LLMs) deployed on edge devices.<n> CoreGuard employs an efficient protection protocol to reduce computational overhead and minimize communication overhead via a propagation protocol.
arXiv Detail & Related papers (2024-10-16T08:14:24Z) - TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment [34.8682729537795]
We propose TransLinkGuard, a plug-and-play model protection approach against model stealing on edge devices.
The core part of TransLinkGuard is a lightweight authorization module residing in a secure environment.
Extensive experiments show that TransLinkGuard achieves the same security protection as the black-box security guarantees with negligible overhead.
arXiv Detail & Related papers (2024-04-17T07:08:45Z) - HasTEE+ : Confidential Cloud Computing and Analytics with Haskell [50.994023665559496]
Confidential computing enables the protection of confidential code and data in a co-tenanted cloud deployment using specialized hardware isolation units called Trusted Execution Environments (TEEs)
TEEs offer low-level C/C++-based toolchains that are susceptible to inherent memory safety vulnerabilities and lack language constructs to monitor explicit and implicit information-flow leaks.
We address the above with HasTEE+, a domain-specific language (cla) embedded in Haskell that enables programming TEEs in a high-level language with strong type-safety.
arXiv Detail & Related papers (2024-01-17T00:56:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.