Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
- URL: http://arxiv.org/abs/2510.13351v1
- Date: Wed, 15 Oct 2025 09:40:24 GMT
- Title: Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
- Authors: Karthik Avinash, Nikhil Pareek, Rishav Hada,
- Abstract summary: Protect is a multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs.<n>It integrates category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset.<n>Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels.
- Score: 4.404101728634984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing deployment of Large Language Models (LLMs) across enterprise and mission-critical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability -- limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multi-modal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels across modalities. Experimental results demonstrate state-of-the-art performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and production-ready safety systems capable of operating across text, image, and audio modalities.
Related papers
- BarrierSteer: LLM Safety via Learning Barrier Steering [83.12893815611052]
BarrierSteer is a novel framework that formalizes safety by embedding learned non-linear safety constraints directly into the model's latent representation space.<n>We show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.
arXiv Detail & Related papers (2026-02-23T18:19:46Z) - AprielGuard [2.3704817495377526]
Existing tools treat safety risks as separate problems, limiting robustness and generalizability.<n>We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework.<n> AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations.
arXiv Detail & Related papers (2025-12-23T12:01:32Z) - OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning [25.190494543355047]
We propose OmniGuard, a family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability.<n>To support the training of OmniGuard, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples.<n>Experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios.
arXiv Detail & Related papers (2025-12-02T01:01:44Z) - Qwen3Guard Technical Report [127.69960525219051]
We present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants.<n>Generative Qwen3Guard casts safety classification as an instruction-following task to enable fine-grained tri-class judgments.<n>Stream Qwen3Guard introduces a token-level classification head for real-time safety monitoring.
arXiv Detail & Related papers (2025-10-16T04:00:18Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z) - Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding [59.50808215134678]
This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs.<n>Results reveal significant limitations in dynamic scene comprehension, cross-modal resilience and real-world risk mitigation.
arXiv Detail & Related papers (2025-06-14T04:04:54Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [25.606641582511106]
We propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance.<n>Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks.
arXiv Detail & Related papers (2025-01-30T17:59:45Z) - RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting [7.0595410083835315]
RapGuard is a novel framework that uses multimodal chain-of-thought reasoning to generate scenario-specific safety prompts.<n>RapGuard achieves state-of-the-art safety performance, significantly reducing harmful content without degrading the quality of responses.
arXiv Detail & Related papers (2024-12-25T08:31:53Z) - A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection [0.0]
Large Language Models (LLMs) are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope.<n>Current guardrails suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production.<n>We introduce a flexible, data-free guardrail development methodology that addresses these challenges.
arXiv Detail & Related papers (2024-11-20T00:31:23Z) - A Middle Path for On-Premises LLM Deployment: Preserving Privacy Without Sacrificing Model Confidentiality [20.646221081945523]
Privacy-sensitive users require deploying large language models (LLMs) within their own infrastructure (on-premises) to safeguard private data and enable customization.<n>Previous research on small models has explored securing only the output layer within hardware-secured devices to balance model confidentiality and customization.<n>We propose SOLID, a novel deployment framework that secures a few bottom layers in a secure environment and introduces an efficient metric to optimize the trade-off.
arXiv Detail & Related papers (2024-10-15T02:00:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.