Related papers: Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents

Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents

URL: http://arxiv.org/abs/2503.00061v2
Date: Tue, 04 Mar 2025 03:32:46 GMT
Title: Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents
Authors: Qiusi Zhan, Richard Fang, Henil Shalin Panchal, Daniel Kang,
Abstract summary: We evaluate eight different defenses and bypass all of them using adaptive attacks, consistently achieving an attack success rate of over 50%.<n>Our research underscores the need for adaptive attack evaluation when designing defenses to ensure robustness and reliability.
Score: 3.5248694676821484
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model (LLM) agents exhibit remarkable performance across diverse applications by using external tools to interact with environments. However, integrating external tools introduces security risks, such as indirect prompt injection (IPI) attacks. Despite defenses designed for IPI attacks, their robustness remains questionable due to insufficient testing against adaptive attacks. In this paper, we evaluate eight different defenses and bypass all of them using adaptive attacks, consistently achieving an attack success rate of over 50%. This reveals critical vulnerabilities in current defenses. Our research underscores the need for adaptive attack evaluation when designing defenses to ensure robustness and reliability. The code is available at https://github.com/uiuc-kang-lab/AdaptiveAttackAgent.

Related papers

Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage. We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models [55.93380086403591]
Generative large language models are vulnerable to backdoor attacks.<n>$textitELBA-Bench$ allows attackers to inject backdoor through parameter efficient fine-tuning.<n>$textitELBA-Bench$ provides over 1300 experiments.
arXiv Detail & Related papers (2025-02-22T12:55:28Z)
SPIN: Self-Supervised Prompt INjection [16.253558670549697]
adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Self-supervised Prompt INjection (SPIN) which can detect and reverse these various attacks on LLMs. Our system can reduce the attack success rate by up to 87.9%, while maintaining the performance on benign user requests.
arXiv Detail & Related papers (2024-10-17T05:40:54Z)
CARE: Ensemble Adversarial Robustness Evaluation Against Adaptive Attackers for Security Applications [14.25922051336361]
Ensemble defenses are widely employed in various security-related applications to enhance model performance and robustness. There are no platforms for comprehensive evaluation of ensemble adversarial attacks and defenses in the cybersecurity domain.
arXiv Detail & Related papers (2024-01-20T05:37:09Z)
Adversarial Markov Games: On Adaptive Decision-Based Attacks and Defenses [21.759075171536388]
We show how attacks but also defenses can benefit by it and by learning from each other through interaction. We demonstrate that active defenses, which control how the system responds, are a necessary complement to model hardening when facing decision-based attacks. We lay out effective strategies in ensuring the robustness of ML-based systems deployed in the real-world.
arXiv Detail & Related papers (2023-12-20T21:24:52Z)
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment [31.24530091590395]
We study an attack scenario called Trojan Activation Attack (TA2), which injects trojan steering vectors into the activation layers of Large Language Models. Our experiment results show that TA2 is highly effective and adds little or no overhead to attack efficiency.
arXiv Detail & Related papers (2023-11-15T23:07:40Z)
Avoid Adversarial Adaption in Federated Learning by Multi-Metric Investigations [55.2480439325792]
Federated Learning (FL) facilitates decentralized machine learning model training, preserving data privacy, lowering communication costs, and boosting model performance through diversified data sources. FL faces vulnerabilities such as poisoning attacks, undermining model integrity with both untargeted performance degradation and targeted backdoor attacks. We define a new notion of strong adaptive adversaries, capable of adapting to multiple objectives simultaneously. MESAS is the first defense robust against strong adaptive adversaries, effective in real-world data scenarios, with an average overhead of just 24.37 seconds.
arXiv Detail & Related papers (2023-06-06T11:44:42Z)
Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers. Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z)
Adversarial defense for automatic speaker verification by cascaded self-supervised learning models [101.42920161993455]
More and more malicious attackers attempt to launch adversarial attacks at automatic speaker verification (ASV) systems. We propose a standard and attack-agnostic method based on cascaded self-supervised learning models to purify the adversarial perturbations. Experimental results demonstrate that the proposed method achieves effective defense performance and can successfully counter adversarial attacks.
arXiv Detail & Related papers (2021-02-14T01:56:43Z)
On Adaptive Attacks to Adversarial Example Defenses [123.32678153377915]
This paper lays out the methodology and the approach necessary to perform an adaptive attack against defenses to adversarial examples. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples.
arXiv Detail & Related papers (2020-02-19T18:50:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.