MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents
- URL: http://arxiv.org/abs/2602.14281v3
- Date: Tue, 24 Feb 2026 02:08:48 GMT
- Title: MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents
- Authors: Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir, Linsey Pang, Prakhar Mehrotra, Kun Wang, Qingsong Wen,
- Abstract summary: We propose MCPShield as a plug-in security cognition layer that ensures agent security when invoking MCP-based tools.<n>Our work provides a practical and robust security safeguard for MCP-based tool invocation in open agent ecosystems.
- Score: 39.267334469481916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers. This openness introduces a security misalignment: agents implicitly trust tools exposed by potentially untrusted MCP servers. However, despite its excellent utility, existing agents typically offer limited validation for third-party MCP servers. As a result, agents remain vulnerable to MCP-based attacks that exploit the misalignment between agents and servers throughout the tool invocation lifecycle. In this paper, we propose MCPShield as a plug-in security cognition layer that mitigates this misalignment and ensures agent security when invoking MCP-based tools. Drawing inspiration from human experience-driven tool validation, MCPShield assists agent forms security cognition with metadata-guided probing before invocation. Our method constrains execution within controlled boundaries while cognizing runtime events, and subsequently updates security cognition by reasoning over historical traces after invocation, building on human post-use reflection on tool behavior. Experiments demonstrate that MCPShield exhibits strong generalization in defending against six novel MCP-based attack scenarios across six widely used agentic LLMs, while avoiding false positives on benign servers and incurring low deployment overhead. Overall, our work provides a practical and robust security safeguard for MCP-based tool invocation in open agent ecosystems.
Related papers
- ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback [53.2744585868162]
Monitoring step-level tool invocation behaviors in real time is critical for agent deployment.<n>We first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents.<n>We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning.
arXiv Detail & Related papers (2026-01-15T07:54:32Z) - Towards Verifiably Safe Tool Use for LLM Agents [53.55621104327779]
Large language model (LLM)-based AI agents extend capabilities by enabling access to tools such as data sources, APIs, search engines, code sandboxes, and even other agents.<n>LLMs may invoke unintended tool interactions and introduce risks, such as leaking sensitive data or overwriting critical records.<n>Current approaches to mitigate these risks, such as model-based safeguards, enhance agents' reliability but cannot guarantee system safety.
arXiv Detail & Related papers (2026-01-12T21:31:38Z) - MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP [22.063867518456743]
In implicit tool poisoning, malicious instructions embedded in tool metadata are injected into the agent context during the Model Context Protocol (MCP) registration phase.<n>We propose MCP-ITP, the first automated and adaptive framework for implicit tool poisoning within the MCP ecosystem.
arXiv Detail & Related papers (2026-01-12T10:28:46Z) - Automatic Red Teaming LLM-based Agents with Model Context Protocol Tools [47.32559576064343]
We propose AutoMalTool, an automated red teaming framework for LLM-based agents by generating malicious MCP tools.<n>Our evaluation shows that AutoMalTool effectively generates malicious MCP tools capable of manipulating the behavior of mainstream LLM-based agents.
arXiv Detail & Related papers (2025-09-25T11:14:38Z) - Mind Your Server: A Systematic Study of Parasitic Toolchain Attacks on the MCP Ecosystem [13.95558554298296]
Large language models (LLMs) are increasingly integrated with external systems through the Model Context Protocol (MCP)<n>In this paper, we reveal a new class of attacks, Parasitic Toolchain Attacks, instantiated as MCP Unintended Privacy Disclosure (MCP-UPD)<n>The malicious logic infiltrates the toolchain and unfolds in three phases: Parasitic Ingestion, Privacy Collection, and Privacy Disclosure, culminating in stealthy exfiltration of private data.
arXiv Detail & Related papers (2025-09-08T11:35:32Z) - MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols [7.10162765778832]
We present the first systematic taxonomy of MCP security, identifying 17 attack types across 4 primary attack surfaces.<n>We introduce MCPSecBench, a comprehensive security benchmark and playground that integrates prompt datasets, MCP servers, MCP clients, attack scripts, and protection mechanisms.
arXiv Detail & Related papers (2025-08-17T11:49:16Z) - BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol (MCP) Ecosystem [16.610726885457847]
The Model Context Protocol (MCP) is an emerging standard designed to enable seamless interaction between Large Language Model (LLM) applications and external tools or resources.<n>This paper presents the first end-to-end empirical evaluation of attack vectors targeting the MCP ecosystem.
arXiv Detail & Related papers (2025-05-31T08:01:11Z) - MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits [0.0]
The Model Context Protocol (MCP) is an open protocol that standardizes API calls to large language models (LLMs), data sources, and agentic tools.<n>We show that the current MCP design carries a wide range of security risks for end users.<n>We introduce a safety auditing tool, MCPSafetyScanner, to assess the security of an arbitrary MCP server.
arXiv Detail & Related papers (2025-04-02T21:46:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.