Related papers: Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

URL: http://arxiv.org/abs/2511.22047v1
Date: Thu, 27 Nov 2025 03:01:09 GMT
Title: Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks
Authors: Richard J. Young,
Abstract summary: Large Language Model (LLM) safety guardrail models have emerged as a primary defense mechanism against harmful content generation.<n>This study evaluated ten publicly available guardrail models from Meta, Google, IBM, NVIDIA, Alibaba, and Allen AI across 1,445 test prompts spanning 21 attack categories.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model (LLM) safety guardrail models have emerged as a primary defense mechanism against harmful content generation, yet their robustness against sophisticated adversarial attacks remains poorly characterized. This study evaluated ten publicly available guardrail models from Meta, Google, IBM, NVIDIA, Alibaba, and Allen AI across 1,445 test prompts spanning 21 attack categories. While Qwen3Guard-8B achieved the highest overall accuracy (85.3%, 95% CI: 83.4-87.1%), a critical finding emerged when separating public benchmark prompts from novel attacks: all models showed substantial performance degradation on unseen prompts, with Qwen3Guard dropping from 91.0% to 33.8% (a 57.2 percentage point gap). In contrast, Granite-Guardian-3.2-5B showed the best generalization with only a 6.5% gap. A "helpful mode" jailbreak was also discovered where two guardrail models (Nemotron-Safety-8B, Granite-Guardian-3.2-5B) generated harmful content instead of blocking it, representing a novel failure mode. These findings suggest that benchmark performance may be misleading due to training data contamination, and that generalization ability, not overall accuracy, should be the primary metric for guardrail evaluation.

Related papers

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away [97.11976870616273]
We propose a lightweight inference-time defense that treats safety recovery as a satising constraint rather than an objective.<n>In our evaluations across six open-source MLRMs and four jailbreak benchmarks, SafeThink reduces attack success rates by 30-60%.
arXiv Detail & Related papers (2026-02-11T18:09:17Z)
What Matters For Safety Alignment? [38.86339753409445]
This paper presents a comprehensive empirical study on the safety alignment capabilities of AI systems.<n>We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques.<n>We identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models.
arXiv Detail & Related papers (2026-01-07T12:31:52Z)
Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks [0.0]
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address.<n>We conduct the first systematic testing and comparative evaluation of agentic AI systems.<n>We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy.
arXiv Detail & Related papers (2025-12-16T19:22:50Z)
Self-HarmLLM: Can Large Language Model Harm Itself? [10.208363125551555]
We propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input.<n>We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions.
arXiv Detail & Related papers (2025-10-31T02:23:54Z)
Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models [0.0]
Large language models (LLMs) remain vulnerable to sophisticated prompt engineering attacks.<n>We introduce Jailbreak Mimicry, a systematic methodology for training compact attacker models to automatically generate narrative-based jailbreak prompts.<n>Our approach transforms adversarial prompt discovery from manual craftsmanship into a reproducible scientific process.
arXiv Detail & Related papers (2025-10-24T23:53:16Z)
Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts [0.0]
This paper presents a systematic security assessment of four prominent Large Language Models (LLMs) against adversarial attack vectors.<n>We evaluate Phi-2, Llama-2-7B-Chat, GPT-3.5-Turbo, and GPT-4 across four distinct attack categories: human-written prompts, AutoDAN, Greedy Coordinate Gradient (GCG), and Tree-of-Attacks-with-pruning (TAP)
arXiv Detail & Related papers (2025-10-12T21:48:34Z)
WebGuard: Building a Generalizable Guardrail for Web Agents [59.31116061613742]
WebGuard is the first dataset designed to support the assessment of web agent action risks.<n>It contains 4,939 human-annotated actions from 193 websites across 22 diverse domains.
arXiv Detail & Related papers (2025-07-18T18:06:27Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
Evaluating the Robustness of the "Ensemble Everything Everywhere" Defense [90.7494670101357]
Ensemble everything everywhere is a defense to adversarial examples.<n>We show that this defense is not robust to adversarial attack.<n>We then use standard adaptive attack techniques to reduce the defense's robust accuracy.
arXiv Detail & Related papers (2024-11-22T10:17:32Z)
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring [47.40698758003993]
We propose an improved transfer attack method that guides malicious prompt construction by locally training a mirror model of the target black-box model through benign data distillation.<n>Our approach achieved a maximum attack success rate of 92%, or a balanced value of 80% with an average of 1.5 detectable jailbreak queries per sample against GPT-3.5 Turbo.
arXiv Detail & Related papers (2024-10-28T14:48:05Z)
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.<n>WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z)
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases [32.2246459413988]
Red-teaming aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. We present a new perspective on safety research i.e., red-teaming through Unalignment. Unalignment tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior.
arXiv Detail & Related papers (2023-10-22T13:55:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.