Related papers: No Free Lunch with Guardrails

No Free Lunch with Guardrails

URL: http://arxiv.org/abs/2504.00441v2
Date: Thu, 03 Apr 2025 13:34:57 GMT
Title: No Free Lunch with Guardrails
Authors: Divyanshu Kumar, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi,
Abstract summary: We evaluate whether current guardrails effectively prevent misuse while maintaining practical utility.<n>Our findings confirm that there is no free lunch with guardrails; strengthening security often comes at the cost of usability.<n>We propose a blueprint for designing better guardrails that minimize risk while maintaining usability.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: As large language models (LLMs) and generative AI become widely adopted, guardrails have emerged as a key tool to ensure their safe use. However, adding guardrails isn't without tradeoffs; stronger security measures can reduce usability, while more flexible systems may leave gaps for adversarial attacks. In this work, we explore whether current guardrails effectively prevent misuse while maintaining practical utility. We introduce a framework to evaluate these tradeoffs, measuring how different guardrails balance risk, security, and usability, and build an efficient guardrail. Our findings confirm that there is no free lunch with guardrails; strengthening security often comes at the cost of usability. To address this, we propose a blueprint for designing better guardrails that minimize risk while maintaining usability. We evaluate various industry guardrails, including Azure Content Safety, Bedrock Guardrails, OpenAI's Moderation API, Guardrails AI, Nemo Guardrails, and Enkrypt AI guardrails. Additionally, we assess how LLMs like GPT-4o, Gemini 2.0-Flash, Claude 3.5-Sonnet, and Mistral Large-Latest respond under different system prompts, including simple prompts, detailed prompts, and detailed prompts with chain-of-thought (CoT) reasoning. Our study provides a clear comparison of how different guardrails perform, highlighting the challenges in balancing security and usability.

Related papers

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
Defending Against Prompt Injection With a Few DefensiveTokens [53.7493897456957]
Large language model (LLM) systems interact with external data to perform complex tasks.<n>By injecting instructions into the data accessed by the system, an attacker can override the initial user task with an arbitrary task directed by the attacker.<n>Test-time defenses, e.g., defensive prompting, have been proposed for system developers to attain security only when needed in a flexible manner.<n>We propose DefensiveToken, a test-time defense with prompt injection comparable to training-time alternatives.
arXiv Detail & Related papers (2025-07-10T17:51:05Z)
Reasoning as an Adaptive Defense for Safety [31.00328416755368]
We build a recipe called $textitTARS$ (Training Adaptive Reasoners for Safety)<n>We train models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion.<n>Our work provides an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.
arXiv Detail & Related papers (2025-07-01T17:20:04Z)
SoK: Evaluating Jailbreak Guardrails for Large Language Models [29.82176024701988]
Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities.<n>Guardrails--external defense mechanisms that monitor and control LLM interaction--have emerged as a promising solution.<n>We present the first holistic analysis of jailbreak guardrails for LLMs.
arXiv Detail & Related papers (2025-06-12T11:42:40Z)
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z)
Safety Guardrails for LLM-Enabled Robots [82.0459036717193]
Traditional robot safety approaches do not address the novel vulnerabilities of large language models (LLMs)<n>We propose RoboGuard, a two-stage guardrail architecture to ensure the safety of LLM-enabled robots.<n>We show that RoboGuard reduces the execution of unsafe plans from 92% to below 2.5% without compromising performance on safe plans.
arXiv Detail & Related papers (2025-03-10T22:01:56Z)
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks [55.29301192316118]
Large language models (LLMs) are highly vulnerable to jailbreaking attacks.<n>We propose a safety steering framework grounded in safe control theory.<n>Our method achieves invariant safety at each turn of dialogue by learning a safety predictor.
arXiv Detail & Related papers (2025-02-28T21:10:03Z)
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs [44.023741610675266]
Large language models (LLMs) can be manipulated into unsafe behaviour by prompts known as jailbreaks.<n>Not all defences are able to handle new out-of-distribution attacks due to the narrow segment of jailbreaks used to align them.<n>We show that based on current datasets available for evaluation, simple baselines can display competitive out-of-distribution performance.
arXiv Detail & Related papers (2025-02-21T12:54:25Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Benchmarking LLM Guardrails in Handling Multilingual Toxicity [57.296161186129545]
We introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We investigate the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts.
arXiv Detail & Related papers (2024-10-29T15:51:24Z)
Evaluating Defences against Unsafe Feedback in RLHF [26.872318173182414]
This paper looks at learning from unsafe feedback with reinforcement learning.<n>We find that safety-aligned LLMs easily explore unsafe action spaces via generating harmful text.<n>In order to protect against this vulnerability, we adapt a number of both "implict" and "explicit" harmful fine-tuning defences.
arXiv Detail & Related papers (2024-09-19T17:10:34Z)
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing [1.474945380093949]
Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance. Current methods struggle in balancing safety with helpfulness. We propose PrimeGuard, a novel ITG method that utilizes structured control flow.
arXiv Detail & Related papers (2024-07-23T09:14:27Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning [8.408258504178718]
Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them. We propose $R2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. $R2$-Guard significantly surpasses SOTA method LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreaking attacks.
arXiv Detail & Related papers (2024-07-08T02:15:29Z)
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning [79.07152553060601]
We propose GuardAgent, the first guardrail agent to protect the target agents by dynamically checking whether their actions satisfy given safety guard requests.<n>Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution.<n>We show that GuardAgent effectively moderates the violation actions for different types of agents on two benchmarks with over 98% and 83% guardrail accuracies.
arXiv Detail & Related papers (2024-06-13T14:49:26Z)
Fight Back Against Jailbreaking via Prompt Adversarial Tuning [23.55544992740663]
Large Language Models (LLMs) are susceptible to jailbreaking attacks. We propose an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control attached to the user prompt as a guard prefix. Our method is effective against both grey-box and black-box attacks, reducing the success rate of advanced attacks to nearly 0%.
arXiv Detail & Related papers (2024-02-09T09:09:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.