Related papers: The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

URL: http://arxiv.org/abs/2510.09023v1
Date: Fri, 10 Oct 2025 05:51:04 GMT
Title: The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
Authors: Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tramèr,
Abstract summary: Current defenses against jailbreaks and prompt injections are typically evaluated against a static set of harmful attack strings.<n>We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design.
Score: 74.60337113759313
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.

Related papers

Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z)
A Critical Evaluation of Defenses against Prompt Injection Attacks [95.81023801370073]
Large Language Models (LLMs) are vulnerable to prompt injection attacks.<n>Several defenses have recently been proposed, often claiming to mitigate these attacks successfully.<n>We argue that existing studies lack a principled approach to evaluating these defenses.
arXiv Detail & Related papers (2025-05-23T19:39:56Z)
Counter-Samples: A Stateless Strategy to Neutralize Black Box Adversarial Attacks [2.9815109163161204]
Our paper presents a novel defence against black box attacks, where attackers use the victim model as an oracle to craft their adversarial examples. Unlike traditional preprocessing defences that rely on sanitizing input samples, our strategy counters the attack process itself. We demonstrate that our approach is remarkably effective against state-of-the-art black box attacks and outperforms existing defences for both the CIFAR-10 and ImageNet datasets.
arXiv Detail & Related papers (2024-03-14T10:59:54Z)
Hindering Adversarial Attacks with Multiple Encrypted Patch Embeddings [13.604830818397629]
We propose a new key-based defense focusing on both efficiency and robustness. We build upon the previous defense with two major improvements: (1) efficient training and (2) optional randomization. Experiments were carried out on the ImageNet dataset, and the proposed defense was evaluated against an arsenal of state-of-the-art attacks.
arXiv Detail & Related papers (2023-09-04T14:08:34Z)
Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators [49.52538232104449]
It is becoming increasingly imperative to design robust ML defenses. Recent work has found that many defenses that initially resist state-of-the-art attacks can be broken by an adaptive adversary. We take steps to simplify the design of defenses and argue that white-box defenses should eschew randomness when possible.
arXiv Detail & Related papers (2023-02-27T01:33:31Z)
A Game-Theoretic Approach for AI-based Botnet Attack Defence [5.020067709306813]
New generation of botnets leverage Artificial Intelligent (AI) techniques to conceal the identity of botmasters and the attack intention to avoid detection. There has not been an existing assessment tool capable of evaluating the effectiveness of existing defense strategies against this kind of AI-based botnet attack. We propose a sequential game theory model that is capable to analyse the details of the potential strategies botnet attackers and defenders could use to reach Nash Equilibrium (NE)
arXiv Detail & Related papers (2021-12-04T02:53:40Z)
TROJANZOO: Everything you ever wanted to know about neural backdoors (but were afraid to ask) [28.785693760449604]
TROJANZOO is the first open-source platform for evaluating neural backdoor attacks/defenses. It has 12 representative attacks, 15 state-of-the-art defenses, 6 attack performance metrics, 10 defense utility metrics, as well as rich tools for analysis of attack-defense interactions. We conduct a systematic study of existing attacks/defenses, leading to a number of interesting findings.
arXiv Detail & Related papers (2020-12-16T22:37:27Z)
Guided Adversarial Attack for Evaluating and Enhancing Adversarial Defenses [59.58128343334556]
We introduce a relaxation term to the standard loss, that finds more suitable gradient-directions, increases attack efficacy and leads to more efficient adversarial training. We propose Guided Adversarial Margin Attack (GAMA), which utilizes function mapping of the clean image to guide the generation of adversaries. We also propose Guided Adversarial Training (GAT), which achieves state-of-the-art performance amongst single-step defenses.
arXiv Detail & Related papers (2020-11-30T16:39:39Z)
Deflecting Adversarial Attacks [94.85315681223702]
We present a new approach towards ending this cycle where we "deflect" adversarial attacks by causing the attacker to produce an input that resembles the attack's target class. We first propose a stronger defense based on Capsule Networks that combines three detection mechanisms to achieve state-of-the-art detection performance.
arXiv Detail & Related papers (2020-02-18T06:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.