Related papers: STACK: Adversarial Attacks on LLM Safeguard Pipelines

STACK: Adversarial Attacks on LLM Safeguard Pipelines

URL: http://arxiv.org/abs/2506.24068v2
Date: Fri, 18 Jul 2025 01:26:48 GMT
Title: STACK: Adversarial Attacks on LLM Safeguard Pipelines
Authors: Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave,
Abstract summary: Anthropic guards latest Claude 4 Opus model using one such defense pipeline.<n>Other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses.<n>We address this gap by developing and red-teaming an open-source defense pipeline.
Score: 5.784929232265091
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.

Related papers

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain [82.98626829232899]
Fine-tuning AI agents on data from their own interactions introduces a critical security vulnerability within the AI supply chain.<n>We show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors.
arXiv Detail & Related papers (2025-10-03T12:47:21Z)
Cuckoo Attack: Stealthy and Persistent Attacks Against AI-IDE [64.47951172662745]
Cuckoo Attack is a novel attack that achieves stealthy and persistent command execution by embedding malicious payloads into configuration files.<n>We formalize our attack paradigm into two stages, including initial infection and persistence.<n>We contribute seven actionable checkpoints for vendors to evaluate their product security.
arXiv Detail & Related papers (2025-09-19T04:10:52Z)
Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z)
Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs [8.09404178079053]
Retrieval-Augmented Generation (RAG) integrates Large Language Models (LLMs) with external knowledge bases, improving output quality while introducing new security risks.<n>Existing studies on RAG vulnerabilities typically focus on exploiting the retrieval mechanism to inject erroneous knowledge or malicious texts, inducing incorrect outputs.<n>In this paper, we uncover a novel vulnerability: the safety guardrails of LLMs, while designed for protection, can also be exploited as an attack vector by adversaries. Building on this vulnerability, we propose MutedRAG, a novel denial-of-service attack that reversely leverages the guardrails to undermine the availability of
arXiv Detail & Related papers (2025-04-30T14:18:11Z)
No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data [22.667573777927203]
We present a new fine-tuning attack that trains models to first refuse harmful requests before answering them.<n>This "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters.<n>Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic.
arXiv Detail & Related papers (2025-02-26T20:20:01Z)
Stealthy and Robust Backdoor Attack against 3D Point Clouds through Additional Point Features [7.066252856912398]
3D backdoor attacks have posed a substantial threat to 3D Deep Neural Networks (3D DNNs) designed for 3D point clouds.<n>This paper introduces the Stealthy and Robust Backdoor Attack (SRBA), which ensures robustness and stealthiness through intentional design considerations.
arXiv Detail & Related papers (2024-12-10T13:48:11Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise.<n>Recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks. backdoor attack is an emerging yet threatening training-phase threat. We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z)
Evaluating Gradient Inversion Attacks and Defenses in Federated Learning [43.993693910541275]
This paper evaluates existing attacks and defenses against gradient inversion attacks. We show the trade-offs of privacy leakage and data utility of three proposed defense mechanisms. Our findings suggest that the state-of-the-art attacks can currently be defended against with minor data utility loss.
arXiv Detail & Related papers (2021-11-30T19:34:16Z)
Certifiers Make Neural Networks Vulnerable to Availability Attacks [70.69104148250614]
We show for the first time that fallback strategies can be deliberately triggered by an adversary. In addition to naturally occurring abstains for some inputs and perturbations, the adversary can use training-time attacks to deliberately trigger the fallback. We design two novel availability attacks, which show the practical relevance of these threats.
arXiv Detail & Related papers (2021-08-25T15:49:10Z)
Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms. We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection. Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z)
A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems. This paper proposes a self-supervised adversarial training mechanism in the input space. It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
Mitigating Advanced Adversarial Attacks with More Advanced Gradient Obfuscation Techniques [13.972753012322126]
Deep Neural Networks (DNNs) are well-known to be vulnerable to Adversarial Examples (AEs) Recently, advanced gradient-based attack techniques were proposed. In this paper, we make a steady step towards mitigating those advanced gradient-based attacks.
arXiv Detail & Related papers (2020-05-27T23:42:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.