Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution
- URL: http://arxiv.org/abs/2506.01055v1
- Date: Sun, 01 Jun 2025 15:48:06 GMT
- Title: Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution
- Authors: Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, Fabrizio Gilardi,
- Abstract summary: This paper examines how prompt injection can cause tool-calling agents to leak personal data observed during task execution.<n>Using a fictitious banking agent, we develop data flow-based attacks and integrate them into AgentDojo, a recent benchmark for agentic security.
- Score: 7.2497315292753415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous benchmarks on prompt injection in large language models (LLMs) have primarily focused on generic tasks and attacks, offering limited insights into more complex threats like data exfiltration. This paper examines how prompt injection can cause tool-calling agents to leak personal data observed during task execution. Using a fictitious banking agent, we develop data flow-based attacks and integrate them into AgentDojo, a recent benchmark for agentic security. To enhance its scope, we also create a richer synthetic dataset of human-AI banking conversations. In 16 user tasks from AgentDojo, LLMs show a 15-50 percentage point drop in utility under attack, with average attack success rates (ASR) around 20 percent; some defenses reduce ASR to zero. Most LLMs, even when successfully tricked by the attack, avoid leaking highly sensitive data like passwords, likely due to safety alignments, but they remain vulnerable to disclosing other personal data. The likelihood of password leakage increases when a password is requested along with one or two additional personal details. In an extended evaluation across 48 tasks, the average ASR is around 15 percent, with no built-in AgentDojo defense fully preventing leakage. Tasks involving data extraction or authorization workflows, which closely resemble the structure of exfiltration attacks, exhibit the highest ASRs, highlighting the interaction between task type, agent performance, and defense efficacy.
Related papers
- Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition [101.86739402748995]
We run the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios.<n>We build the Agent Red Teaming benchmark and evaluate it across 19 state-of-the-art models.<n>Our findings highlight critical and persistent vulnerabilities in today's AI agents.
arXiv Detail & Related papers (2025-07-28T05:13:04Z) - AGENTFUZZER: Generic Black-Box Fuzzing for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentXploit, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentXploit on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models [25.579489111240136]
We present a novel attack termed StruPhantom which specifically targets black-box LLM-powered tabular agents.<n>Our attack achieves over 50% higher success rates than baselines in enforcing the application's response to contain phishing links or malicious codes.
arXiv Detail & Related papers (2025-04-14T03:22:04Z) - Defeating Prompt Injections by Design [79.00910871948787]
CaMeL is a robust defense that creates a protective system layer around the Large Language Models (LLMs)<n>To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query.<n>We demonstrate effectiveness of CaMeL by solving $67%$ of tasks with provable security in AgentDojo [NeurIPS 2024], a recent agentic security benchmark.
arXiv Detail & Related papers (2025-03-24T15:54:10Z) - A Practical Memory Injection Attack against LLM Agents [49.01756339657071]
MINJA enables the injection of malicious records into the memory bank by only interacting with the agent via queries and output observations.<n> MINJA enables any user to influence agent memory, highlighting practical risks of LLM agents.
arXiv Detail & Related papers (2025-03-05T17:53:24Z) - MELON: Provable Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions.<n>We present MELON, a novel IPI defense that detects attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function.
arXiv Detail & Related papers (2025-02-07T18:57:49Z) - The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents [6.829628038851487]
Large Language Model (LLM) agents are increasingly being deployed as conversational assistants capable of performing complex real-world tasks through tool integration.<n>In particular, indirect prompt injection attacks pose a critical threat, where malicious instructions embedded within external data sources can manipulate agents to deviate from user intentions.<n>We propose a novel perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives.
arXiv Detail & Related papers (2024-12-21T16:17:48Z) - AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents [27.701301913159067]
We introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data.
AgentDojo is not a static test suite, but rather an environment for designing and evaluating new agent tasks, defenses, and adaptive attacks.
We populate AgentDojo with 97 realistic tasks, 629 security test cases, and various attack and defense paradigms from the literature.
arXiv Detail & Related papers (2024-06-19T08:55:56Z) - Air Gap: Protecting Privacy-Conscious Conversational Agents [44.04662124191715]
We introduce a novel threat model where adversarial third-party apps manipulate the context of interaction to trick LLM-based agents into revealing private information not relevant to the task at hand.
We introduce AirGapAgent, a privacy-conscious agent designed to prevent unintended data leakage by restricting the agent's access to only the data necessary for a specific task.
arXiv Detail & Related papers (2024-05-08T16:12:45Z) - InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents [3.5248694676821484]
We introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks.
InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools.
We show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time.
arXiv Detail & Related papers (2024-03-05T06:21:45Z) - Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection.
To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs.
We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.