Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools
- URL: http://arxiv.org/abs/2508.02110v1
- Date: Mon, 04 Aug 2025 06:38:59 GMT
- Title: Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools
- Authors: Kanghua Mo, Li Hu, Yucheng Long, Zhihao Li,
- Abstract summary: Large language model (LLM) agents have demonstrated remarkable capabilities in complex reasoning and decision-making by leveraging external tools.<n>We identify this as a new and stealthy threat surface that allows malicious tools to be preferentially selected by LLM agents.<n>We propose a black-box in-context learning framework that generates highly attractive but syntactically and semantically valid tool metadata.
- Score: 10.086284534400658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language model (LLM) agents have demonstrated remarkable capabilities in complex reasoning and decision-making by leveraging external tools. However, this tool-centric paradigm introduces a previously underexplored attack surface: adversaries can manipulate tool metadata -- such as names, descriptions, and parameter schemas -- to influence agent behavior. We identify this as a new and stealthy threat surface that allows malicious tools to be preferentially selected by LLM agents, without requiring prompt injection or access to model internals. To demonstrate and exploit this vulnerability, we propose the Attractive Metadata Attack (AMA), a black-box in-context learning framework that generates highly attractive but syntactically and semantically valid tool metadata through iterative optimization. Our attack integrates seamlessly into standard tool ecosystems and requires no modification to the agent's execution framework. Extensive experiments across ten realistic, simulated tool-use scenarios and a range of popular LLM agents demonstrate consistently high attack success rates (81\%-95\%) and significant privacy leakage, with negligible impact on primary task execution. Moreover, the attack remains effective even under prompt-level defenses and structured tool-selection protocols such as the Model Context Protocol, revealing systemic vulnerabilities in current agent architectures. These findings reveal that metadata manipulation constitutes a potent and stealthy attack surface, highlighting the need for execution-level security mechanisms that go beyond prompt-level defenses.
Related papers
- AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs [24.71883582216731]
AdapTools is a novel adaptive IPI attack framework that selects stealthier attack tools and generates adaptive attack prompts.<n>AdapTools achieves a 2.13 times improvement in attack success rate while degrading system utility by a factor of 1.78.
arXiv Detail & Related papers (2026-02-24T09:32:19Z) - SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement [120.52289344734415]
We propose an automated framework for stealthy prompt injection tailored to agent skills.<n>The framework forms a closed loop with three agents: an Attack Agent that synthesizes injection skills under explicit stealth constraints, a Code Agent that executes tasks using the injected skills and an Evaluate Agent that logs action traces.<n>Our method consistently achieves high attack success rates under realistic settings.
arXiv Detail & Related papers (2026-02-15T16:09:48Z) - Sponge Tool Attack: Stealthy Denial-of-Efficiency against Tool-Augmented Agentic Reasoning [58.432996881401415]
Recent work augments large language models (LLMs) with external tools to enable agentic reasoning.<n>We propose Sponge Tool Attack (STA), which disrupts agentic reasoning solely by rewriting the input prompt.<n>STA generates benign-looking prompt rewrites from the original one with high semantic fidelity.
arXiv Detail & Related papers (2026-01-24T19:36:51Z) - Securing the Model Context Protocol: Defending LLMs Against Tool Poisoning and Adversarial Attacks [8.419049623790618]
This work analyzes three classes of semantic attacks on MCP-integrated systems.<n>We introduce a layered security framework with three components: RSA-based manifest signing to enforce descriptor integrity, LLM-on-LLM semantic vetting to detect suspicious tool definitions, and lightweight guardrails that block anomalous tool behavior at runtime.<n>Our results show that the proposed framework reduces unsafe tool invocation rates without model fine-tuning or internal modification.
arXiv Detail & Related papers (2025-12-06T20:07:58Z) - Exploiting Web Search Tools of AI Agents for Data Exfiltration [0.46664938579243564]
Large language models (LLMs) are now routinely used to execute complex tasks, from natural language processing to dynamic like web searches.<n>The usage of tool-calling and Retrieval Augmented Generation (RAG) allows LLMs to process and retrieve sensitive corporate data, amplifying both their functionality and vulnerability to abuse.<n>We analyze how susceptible current LLMs are to indirect prompt injection attacks, which parameters, including model size and manufacturer, shape their vulnerability, and which attack methods remain most effective.
arXiv Detail & Related papers (2025-10-10T07:39:01Z) - IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents [33.775221377823925]
Large language model (LLM) agents are widely deployed in real-world applications, where they leverage tools to retrieve and manipulate external data for complex tasks.<n>When interacting with untrusted data sources, tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes.<n>We propose a novel defensive task execution paradigm, called IPIGuard, to prevent malicious tool invocations at the source.
arXiv Detail & Related papers (2025-08-21T07:08:16Z) - Searching for Privacy Risks in LLM Agents via Simulation [61.229785851581504]
We present a search-based framework that alternates between improving attack and defense strategies through the simulation of privacy-critical agent interactions.<n>We find that attack strategies escalate from direct requests to sophisticated tactics, such as impersonation and consent forgery.<n>The discovered attacks and defenses transfer across diverse scenarios and backbone models, demonstrating strong practical utility for building privacy-aware agents.
arXiv Detail & Related papers (2025-08-14T17:49:09Z) - A Survey on Model Extraction Attacks and Defenses for Large Language Models [55.60375624503877]
Model extraction attacks pose significant security threats to deployed language models.<n>This survey provides a comprehensive taxonomy of extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks.<n>We examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios.
arXiv Detail & Related papers (2025-06-26T22:02:01Z) - MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models [56.09354775405601]
Model extraction attacks aim to replicate the functionality of a black-box model through query access.<n>Most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs.<n>We propose MISLEADER, a novel defense strategy that does not rely on OOD assumptions.
arXiv Detail & Related papers (2025-06-03T01:37:09Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - Prompt Injection Attack to Tool Selection in LLM Agents [74.90338504778781]
We introduce textitToolHijacker, a novel prompt injection attack targeting tool selection in no-box scenarios.<n>ToolHijacker injects a malicious tool document into the tool library to manipulate the LLM agent's tool selection process.<n>We show that ToolHijacker is highly effective, significantly outperforming existing manual-based and automated prompt injection attacks.
arXiv Detail & Related papers (2025-04-28T13:36:43Z) - StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models [25.579489111240136]
We present a novel attack termed StruPhantom which specifically targets black-box LLM-powered tabular agents.<n>Our attack achieves over 50% higher success rates than baselines in enforcing the application's response to contain phishing links or malicious codes.
arXiv Detail & Related papers (2025-04-14T03:22:04Z) - DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents [28.294322726282896]
Large language model (LLM)-powered agents are increasingly used in recommender systems (RSs) to achieve personalized behavior modeling.<n>This paper presents the first systematic investigation of memory-based vulnerabilities in LLM-powered recommender agents.<n>We propose a novel black-box attack framework named DrunkAgent, which crafts semantically meaningful adversarial triggers.
arXiv Detail & Related papers (2025-03-31T07:35:40Z) - MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions.<n>We present MELON, a novel IPI defense that detects attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function.
arXiv Detail & Related papers (2025-02-07T18:57:49Z) - From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection [11.300387488829035]
Tool-calling has changed Large Language Model (LLM) applications by integrating external tools.<n>We present ToolCommander, a novel framework designed to exploit vulnerabilities in LLM tool-calling systems through adversarial tool injection.
arXiv Detail & Related papers (2024-12-13T15:15:24Z) - Imprompter: Tricking LLM Agents into Improper Tool Use [35.255462653237885]
Large Language Model (LLM) Agents are an emerging computing paradigm that blends generative machine learning with tools such as code interpreters, web browsing, email, and more generally, external resources.
We contribute to the security foundations of agent-based systems and surface a new class of automatically computed obfuscated adversarial prompt attacks.
arXiv Detail & Related papers (2024-10-19T01:00:57Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification [35.16099878559559]
Large language models (LLMs) have experienced significant development and are being deployed in real-world applications.
We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions.
Our experiments reveal that these attacks can induce failure rates exceeding 80% in multiple scenarios.
arXiv Detail & Related papers (2024-07-30T14:35:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.