Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences
- URL: http://arxiv.org/abs/2502.08142v1
- Date: Wed, 12 Feb 2025 05:48:57 GMT
- Title: Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences
- Authors: Shanshan Han, Salman Avestimehr, Chaoyang He,
- Abstract summary: We present Wildflare GuardRail, a guardrail pipeline designed to enhance the safety and reliability of Large Language Model (LLM) inferences.<n>Wildflare GuardRail integrates several core functional modules, including Safety Detector that identifies unsafe inputs and detects hallucinations in model outputs.<n>The lightweight wrappers can address malicious URLs in model outputs in 1.06s per query with 100% accuracy without costly model calls.
- Score: 18.36319991890607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Wildflare GuardRail, a guardrail pipeline designed to enhance the safety and reliability of Large Language Model (LLM) inferences by systematically addressing risks across the entire processing workflow. Wildflare GuardRail integrates several core functional modules, including Safety Detector that identifies unsafe inputs and detects hallucinations in model outputs while generating root-cause explanations, Grounding that contextualizes user queries with information retrieved from vector databases, Customizer that adjusts outputs in real time using lightweight, rule-based wrappers, and Repairer that corrects erroneous LLM outputs using hallucination explanations provided by Safety Detector. Results show that our unsafe content detection model in Safety Detector achieves comparable performance with OpenAI API, though trained on a small dataset constructed with several public datasets. Meanwhile, the lightweight wrappers can address malicious URLs in model outputs in 1.06s per query with 100% accuracy without costly model calls. Moreover, the hallucination fixing model demonstrates effectiveness in reducing hallucinations with an accuracy of 80.7%.
Related papers
- The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z) - Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis [0.0]
Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions.<n>Recent work by Hubinger et al. demonstrated that backdoors persist through safety training, yet no practical detection methods exist.<n>We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison.
arXiv Detail & Related papers (2025-11-20T02:42:41Z) - RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts [39.58550043591753]
External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs.<n>We investigated how robust LLM-based guardrails are against additional information embedded in the context.
arXiv Detail & Related papers (2025-10-06T19:20:43Z) - Automating construction safety inspections using a multi-modal vision-language RAG framework [1.737994603273206]
This study introduces SiteShield, a framework for automating construction safety inspection reports by integrating visual and audio inputs.<n>Using real-world data, SiteShield outperformed unimodal LLMs with an F1 score of 0.82, hamming loss of 0.04, precision of 0.76, and recall of 0.96.
arXiv Detail & Related papers (2025-10-05T10:48:54Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment [24.364891513019444]
In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface.<n>We propose LARF, a Layer-Aware Representation Filtering method.<n> Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features.
arXiv Detail & Related papers (2025-07-24T17:59:24Z) - From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring [17.110388909771626]
Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output.<n>Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected.<n>We propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels.
arXiv Detail & Related papers (2025-06-11T17:59:58Z) - Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - Defending against Indirect Prompt Injection by Instruction Detection [81.98614607987793]
We propose a novel approach that takes external data as input and leverages the behavioral state of LLMs during both forward and backward propagation to detect potential IPI attacks.<n>Our approach achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, while reducing the attack success rate to just 0.12% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z) - LookAhead Tuning: Safer Language Models via Partial Answer Previews [38.7113305301502]
LookAhead Tuning mitigates the degradation of model safety during fine-tuning.
Two simple, low-resource, and effective data-driven methods modify training data by previewing partial answer prefixes.
arXiv Detail & Related papers (2025-03-24T18:11:42Z) - Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs? [0.836362570897926]
We investigate existing methods for such generalization and find them insufficient.
To avoid performance degradation and preserve safe performance, we advocate for a two-step framework.
We find that the final hidden state for the last token is enough to provide robust performance.
arXiv Detail & Related papers (2025-02-22T10:31:50Z) - Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states.
Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z) - TrustRAG: Enhancing Robustness and Trustworthiness in RAG [31.231916859341865]
TrustRAG is a framework that systematically filters compromised and irrelevant contents before they are retrieved for generation.<n>TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance compared to existing approaches.
arXiv Detail & Related papers (2025-01-01T15:57:34Z) - Enhancing AI Safety Through the Fusion of Low Rank Adapters [7.384556630042846]
Low-Rank Adapter Fusion mitigates harmful responses when faced with malicious prompts.<n>We show a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter.<n>We also observe exaggerated safety behaviour, where the model rejects safe prompts that closely resemble unsafe ones.
arXiv Detail & Related papers (2024-12-30T13:12:27Z) - Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models [30.93821289892195]
We propose IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs.
The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones.
Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks.
arXiv Detail & Related papers (2024-12-15T03:58:38Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.
We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.
We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models [51.20476412037321]
We propose Safe LoRA, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace.<n>Our experiments demonstrate that when fine-tuning on purely malicious data, Safe LoRA retains similar safety performance as the original aligned model.
arXiv Detail & Related papers (2024-05-27T05:04:05Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Semi-supervised Open-World Object Detection [74.95267079505145]
We introduce a more realistic formulation, named semi-supervised open-world detection (SS-OWOD)
We demonstrate that the performance of the state-of-the-art OWOD detector dramatically deteriorates in the proposed SS-OWOD setting.
Our experiments on 4 datasets including MS COCO, PASCAL, Objects365 and DOTA demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-02-25T07:12:51Z) - Learning Robust Output Control Barrier Functions from Safe Expert Demonstrations [50.37808220291108]
This paper addresses learning safe output feedback control laws from partial observations of expert demonstrations.
We first propose robust output control barrier functions (ROCBFs) as a means to guarantee safety.
We then formulate an optimization problem to learn ROCBFs from expert demonstrations that exhibit safe system behavior.
arXiv Detail & Related papers (2021-11-18T23:21:00Z) - Uncertainty for Identifying Open-Set Errors in Visual Object Detection [31.533136658421892]
GMM-Det is a real-time method for extracting uncertainty from object detectors to identify and reject open-set errors.
We show that GMM-Det consistently outperforms existing uncertainty techniques for identifying and rejecting open-set detections.
arXiv Detail & Related papers (2021-04-03T07:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.