Related papers: Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences

Related papers

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z)
Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis [0.0]
Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions.<n>Recent work by Hubinger et al. demonstrated that backdoors persist through safety training, yet no practical detection methods exist.<n>We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison.
arXiv Detail & Related papers (2025-11-20T02:42:41Z)
RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts [39.58550043591753]
External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs.<n>We investigated how robust LLM-based guardrails are against additional information embedded in the context.
arXiv Detail & Related papers (2025-10-06T19:20:43Z)
Automating construction safety inspections using a multi-modal vision-language RAG framework [1.737994603273206]
This study introduces SiteShield, a framework for automating construction safety inspection reports by integrating visual and audio inputs.<n>Using real-world data, SiteShield outperformed unimodal LLMs with an F1 score of 0.82, hamming loss of 0.04, precision of 0.76, and recall of 0.96.
arXiv Detail & Related papers (2025-10-05T10:48:54Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment [24.364891513019444]
In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface.<n>We propose LARF, a Layer-Aware Representation Filtering method.<n> Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features.
arXiv Detail & Related papers (2025-07-24T17:59:24Z)
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring [17.110388909771626]
Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output.<n>Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected.<n>We propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels.
arXiv Detail & Related papers (2025-06-11T17:59:58Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Defending against Indirect Prompt Injection by Instruction Detection [81.98614607987793]
We propose a novel approach that takes external data as input and leverages the behavioral state of LLMs during both forward and backward propagation to detect potential IPI attacks.<n>Our approach achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, while reducing the attack success rate to just 0.12% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z)
LookAhead Tuning: Safer Language Models via Partial Answer Previews [38.7113305301502]
LookAhead Tuning mitigates the degradation of model safety during fine-tuning. Two simple, low-resource, and effective data-driven methods modify training data by previewing partial answer prefixes.
arXiv Detail & Related papers (2025-03-24T18:11:42Z)
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs? [0.836362570897926]
We investigate existing methods for such generalization and find them insufficient. To avoid performance degradation and preserve safe performance, we advocate for a two-step framework. We find that the final hidden state for the last token is enough to provide robust performance.
arXiv Detail & Related papers (2025-02-22T10:31:50Z)
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states. Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
TrustRAG: Enhancing Robustness and Trustworthiness in RAG [31.231916859341865]
TrustRAG is a framework that systematically filters compromised and irrelevant contents before they are retrieved for generation.<n>TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance compared to existing approaches.
arXiv Detail & Related papers (2025-01-01T15:57:34Z)
Enhancing AI Safety Through the Fusion of Low Rank Adapters [7.384556630042846]
Low-Rank Adapter Fusion mitigates harmful responses when faced with malicious prompts.<n>We show a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter.<n>We also observe exaggerated safety behaviour, where the model rejects safe prompts that closely resemble unsafe ones.
arXiv Detail & Related papers (2024-12-30T13:12:27Z)
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models [30.93821289892195]
We propose IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs. The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones. Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks.
arXiv Detail & Related papers (2024-12-15T03:58:38Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models [51.20476412037321]
We propose Safe LoRA, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace.<n>Our experiments demonstrate that when fine-tuning on purely malicious data, Safe LoRA retains similar safety performance as the original aligned model.
arXiv Detail & Related papers (2024-05-27T05:04:05Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
Semi-supervised Open-World Object Detection [74.95267079505145]
We introduce a more realistic formulation, named semi-supervised open-world detection (SS-OWOD) We demonstrate that the performance of the state-of-the-art OWOD detector dramatically deteriorates in the proposed SS-OWOD setting. Our experiments on 4 datasets including MS COCO, PASCAL, Objects365 and DOTA demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-02-25T07:12:51Z)
Learning Robust Output Control Barrier Functions from Safe Expert Demonstrations [50.37808220291108]
This paper addresses learning safe output feedback control laws from partial observations of expert demonstrations. We first propose robust output control barrier functions (ROCBFs) as a means to guarantee safety. We then formulate an optimization problem to learn ROCBFs from expert demonstrations that exhibit safe system behavior.
arXiv Detail & Related papers (2021-11-18T23:21:00Z)
Uncertainty for Identifying Open-Set Errors in Visual Object Detection [31.533136658421892]
GMM-Det is a real-time method for extracting uncertainty from object detectors to identify and reject open-set errors. We show that GMM-Det consistently outperforms existing uncertainty techniques for identifying and rejecting open-set detections.
arXiv Detail & Related papers (2021-04-03T07:12:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.