PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
- URL: http://arxiv.org/abs/2407.16318v1
- Date: Tue, 23 Jul 2024 09:14:27 GMT
- Title: PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
- Authors: Blazej Manczak, Eliott Zemour, Eric Lin, Vaikkunth Mugunthan,
- Abstract summary: Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance.
Current methods struggle in balancing safety with helpfulness.
We propose PrimeGuard, a novel ITG method that utilizes structured control flow.
- Score: 1.474945380093949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax by (1) significantly increasing resistance to iterative jailbreak attacks and (2) achieving state-of-the-art results in safety guardrailing while (3) matching helpfulness scores of alignment-tuned models. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, outperforms all competing baselines and overcomes the guardrail tax by improving the fraction of safe responses from 61% to 97% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rate from 100% to 8%. PrimeGuard implementation is available at https://github.com/dynamofl/PrimeGuard and safe-eval dataset is available at https://huggingface.co/datasets/dynamoai/safe_eval.
Related papers
- CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety [8.24714635902347]
We introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate.<n>By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks.
arXiv Detail & Related papers (2026-02-26T02:52:11Z) - Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection [52.551864761088574]
Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility.<n>We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment.<n>We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA) to balance plasticity and stability.
arXiv Detail & Related papers (2026-02-08T09:53:46Z) - Bag of Tricks for Subverting Reasoning-based Safety Guardrails [62.139297207938036]
We present a bag of jailbreak methods that subvert the reasoning-based guardrails.<n>Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization.
arXiv Detail & Related papers (2025-10-13T16:16:44Z) - UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software [11.606665113249298]
Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software.<n>We propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs.<n>We show that AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation.
arXiv Detail & Related papers (2025-09-21T01:22:42Z) - UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases [57.69882799751655]
We release UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources.<n>We fine-tune three large reasoning models (LRMs) and compare them against recent SafeChain and STAR-1.<n>UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance.
arXiv Detail & Related papers (2025-07-29T10:08:52Z) - ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [49.47193675702453]
Large Language Models (LLMs) have demonstrated remarkable generative capabilities.<n>LLMs remain vulnerable to malicious instructions that can bypass safety constraints.<n>We propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one.
arXiv Detail & Related papers (2025-07-14T09:05:54Z) - SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z) - T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks [67.91652526657599]
We formalize the T2V jailbreak attack as a discrete optimization problem and propose a joint objective-based optimization framework, called T2V-OptJail.<n>We conduct large-scale experiments on several T2V models, covering both open-source models and real commercial closed-source models.<n>The proposed method improves 11.4% and 10.0% over the existing state-of-the-art method in terms of attack success rate.
arXiv Detail & Related papers (2025-05-10T16:04:52Z) - AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.
We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.
Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z) - AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender [73.09848497762667]
We propose AdaSteer, an adaptive activation steering method that adjusts model behavior based on input characteristics.
AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD)
Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
arXiv Detail & Related papers (2025-04-13T07:39:17Z) - No Free Lunch with Guardrails [0.0]
We evaluate whether current guardrails effectively prevent misuse while maintaining practical utility.
Our findings confirm that there is no free lunch with guardrails; strengthening security often comes at the cost of usability.
We propose a blueprint for designing better guardrails that minimize risk while maintaining usability.
arXiv Detail & Related papers (2025-04-01T05:46:54Z) - ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [33.96886111900147]
ThinkGuard is a critique-augmented guardrail model that distills knowledge from high-capacity language models.
It achieves the highest average F1 and AUPRC, outperforming all baselines.
It surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning.
arXiv Detail & Related papers (2025-02-19T06:09:58Z) - STAIR: Improving Safety Alignment with Introspective Reasoning [44.780098674618614]
We propose STAIR, a framework that integrates SafeTy Alignment with Itrospective Reasoning.
We show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies.
With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks.
arXiv Detail & Related papers (2025-02-04T15:02:55Z) - AlignGuard: Scalable Safety Alignment for Text-to-Image Generation [68.07258248467309]
Text-to-image (T2I) models are widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse.<n>In this work, we introduce AlignGuard, a method for safety alignment of T2I models.
arXiv Detail & Related papers (2024-12-13T18:59:52Z) - Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [53.54777131440989]
Large Language Models (LLMs) are susceptible to security and safety threats.
One major cause of these vulnerabilities is the lack of an instruction hierarchy.
We introduce the instructional segment Embedding (ISE) technique, inspired by BERT, to modern large language models.
arXiv Detail & Related papers (2024-10-09T12:52:41Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - $R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning [8.408258504178718]
Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them.
We propose $R2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning.
$R2$-Guard significantly surpasses SOTA method LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreaking attacks.
arXiv Detail & Related papers (2024-07-08T02:15:29Z) - WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.
WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z) - SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance [48.80398992974831]
SafeAligner is a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks.
We develop two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses.
We show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones.
arXiv Detail & Related papers (2024-06-26T07:15:44Z) - Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment [56.2017039028998]
Fine-tuning of Language-Model-as-a-Service (LM) introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack)
We propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks.
Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 safety examples, the maliciously finetuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance.
arXiv Detail & Related papers (2024-02-22T21:05:18Z) - Fight Back Against Jailbreaking via Prompt Adversarial Tuning [23.55544992740663]
Large Language Models (LLMs) are susceptible to jailbreaking attacks.
We propose an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control attached to the user prompt as a guard prefix.
Our method is effective against both grey-box and black-box attacks, reducing the success rate of advanced attacks to nearly 0%.
arXiv Detail & Related papers (2024-02-09T09:09:39Z) - Fine-tuning Aligned Language Models Compromises Safety, Even When Users
Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning.
We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples.
We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.