MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison
- URL: http://arxiv.org/abs/2502.05174v1
- Date: Fri, 07 Feb 2025 18:57:49 GMT
- Title: MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison
- Authors: Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, William Yang Wang,
- Abstract summary: LLM agents are vulnerable to indirect prompt injection (IPI) attacks.<n>We present MELON, a novel IPI defense.<n>We show that MELON outperforms SOTA defenses in both attack prevention and utility preservation.
- Score: 60.30753230776882
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent research has explored that LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions. Existing defenses against IPI have significant limitations: either require essential model training resources, lack effectiveness against sophisticated attacks, or harm the normal utilities. We present MELON (Masked re-Execution and TooL comparisON), a novel IPI defense. Our approach builds on the observation that under a successful attack, the agent's next action becomes less dependent on user tasks and more on malicious tasks. Following this, we design MELON to detect attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function. We identify an attack if the actions generated in the original and masked executions are similar. We also include three key designs to reduce the potential false positives and false negatives. Extensive evaluation on the IPI benchmark AgentDojo demonstrates that MELON outperforms SOTA defenses in both attack prevention and utility preservation. Moreover, we show that combining MELON with a SOTA prompt augmentation defense (denoted as MELON-Aug) further improves its performance. We also conduct a detailed ablation study to validate our key designs.
Related papers
- StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models [25.579489111240136]
We present a novel attack termed StruPhantom which specifically targets black-box LLM-powered tabular agents.
Our attack achieves over 50% higher success rates than baselines in enforcing the application's response to contain phishing links or malicious codes.
arXiv Detail & Related papers (2025-04-14T03:22:04Z) - CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent [32.958798200220286]
Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience.
We propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs.
Our method first identifies the insertion position for maximum impact with minimal input modification.
arXiv Detail & Related papers (2025-04-13T05:31:37Z) - A Practical Memory Injection Attack against LLM Agents [49.01756339657071]
MINJA enables the injection of malicious records into the memory bank by only interacting with the agent via queries and output observations.
MINJA enables any user to influence agent memory, highlighting practical risks of LLM agents.
arXiv Detail & Related papers (2025-03-05T17:53:24Z) - Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks.<n>As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise.<n>Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z) - Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.
We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.
We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In [5.65782619470663]
We examine how ReAct agents can be exploited using a straightforward yet effective method we refer to as the foot-in-the-door attack.
Our experiments show that indirect prompt injection attacks can significantly increase the likelihood of the agent performing subsequent malicious actions.
To mitigate this vulnerability, we propose implementing a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution.
arXiv Detail & Related papers (2024-10-22T12:24:41Z) - InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents [3.5248694676821484]
We introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks.
InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools.
We show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time.
arXiv Detail & Related papers (2024-03-05T06:21:45Z) - Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment [31.24530091590395]
We study an attack scenario called Trojan Activation Attack (TA2), which injects trojan steering vectors into the activation layers of Large Language Models.
Our experiment results show that TA2 is highly effective and adds little or no overhead to attack efficiency.
arXiv Detail & Related papers (2023-11-15T23:07:40Z) - Malicious Agent Detection for Robust Multi-Agent Collaborative Perception [52.261231738242266]
Multi-agent collaborative (MAC) perception is more vulnerable to adversarial attacks than single-agent perception.
We propose Malicious Agent Detection (MADE), a reactive defense specific to MAC perception.
We conduct comprehensive evaluations on a benchmark 3D dataset V2X-sim and a real-road dataset DAIR-V2X.
arXiv Detail & Related papers (2023-10-18T11:36:42Z) - Defending Pre-trained Language Models as Few-shot Learners against
Backdoor Attacks [72.03945355787776]
We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners.
We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
arXiv Detail & Related papers (2023-09-23T04:41:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.