Related papers: Quantifying Conversation Drift in MCP via Latent Polytope

Quantifying Conversation Drift in MCP via Latent Polytope

URL: http://arxiv.org/abs/2508.06418v1
Date: Fri, 08 Aug 2025 16:05:27 GMT
Title: Quantifying Conversation Drift in MCP via Latent Polytope
Authors: Haoran Shi, Hongwei Yao, Shuo Shao, Shaopeng Jiao, Ziqi Peng, Zhan Qin, Cong Wang,
Abstract summary: The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools.<n> adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration.<n>We propose SecMCP, a framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge.
Score: 12.004235167472238
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools, enabling dynamic aggregation of real-time data to improve task execution. However, its non-isolated execution context introduces critical security and privacy risks. In particular, adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration. Existing defenses, such as rule-based filters or LLM-driven detection, remain inadequate due to their reliance on static signatures, computational inefficiency, and inability to quantify conversational hijacking. To address these limitations, we propose SecMCP, a secure framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge. By modeling LLM activation vectors within a latent polytope space, SecMCP identifies anomalous shifts in conversational dynamics, enabling proactive detection of hijacking, misleading, and data exfiltration. We evaluate SecMCP on three state-of-the-art LLMs (Llama3, Vicuna, Mistral) across benchmark datasets (MS MARCO, HotpotQA, FinQA), demonstrating robust detection with AUROC scores exceeding 0.915 while maintaining system usability. Our contributions include a systematic categorization of MCP security threats, a novel latent polytope-based methodology for quantifying conversation drift, and empirical validation of SecMCP's efficacy.

Related papers

NeuroFilter: Privacy Guardrails for Conversational LLM Agents [50.75206727081996]
This work addresses the computational challenge of enforcing privacy for agentic Large Language Models (LLMs)<n>NeuroFilter is a guardrail framework that operationalizes contextual integrity by mapping norm violations to simple directions in the model's activation space.<n>A comprehensive evaluation across over 150,000 interactions, covering models from 7B to 70B parameters, illustrates the strong performance of NeuroFilter.
arXiv Detail & Related papers (2026-01-21T05:16:50Z)
Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs [2.2448294058653455]
adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards.<n>We propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts.<n>ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining.
arXiv Detail & Related papers (2026-01-18T11:33:35Z)
STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models [12.133996629992318]
We propose a parameter-free, inference-time unlearning framework that achieves robust privacy protection throughout the reasoning process.<n>Experiments on the R-TOFU benchmark demonstrate that STaR achieves comprehensive and stable unlearning with minimal utility loss.
arXiv Detail & Related papers (2026-01-14T08:35:23Z)
A high-capacity linguistic steganography based on entropy-driven rank-token mapping [81.29800498695899]
Linguistic steganography enables covert communication through embedding secret messages into innocuous texts.<n>Traditional modification-based methods introduce detectable anomalies, while retrieval-based strategies suffer from low embedding capacity.<n>We propose an entropy-driven framework called RTMStega that integrates rank-based adaptive coding and context-aware decompression with normalized entropy.
arXiv Detail & Related papers (2025-10-27T06:02:47Z)
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z)
MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications [21.70488724213541]
integration of Large Language Models with external tools introduces critical security vulnerabilities.<n>We propose MCP-Guard, a robust, layered defense architecture designed for LLM--tool interactions.<n>We also introduce MCP-AttackBench, a benchmark of over 70,000 samples.
arXiv Detail & Related papers (2025-08-14T18:00:25Z)
Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z)
Exploring LLM Reasoning Through Controlled Prompt Variations [0.9217021281095907]
We evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations.<n>Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model's context window significantly degrades performance.<n>Certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting.
arXiv Detail & Related papers (2025-04-02T20:18:50Z)
Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems [89.35169042718739]
collaborative inference enables end users to leverage powerful deep learning models without exposure of sensitive raw data to cloud servers.<n>Recent studies have revealed that these intermediate features may not sufficiently preserve privacy, as information can be leaked and raw data can be reconstructed via model inversion attacks (MIAs)<n>This work first theoretically proves that the conditional entropy of inputs given intermediate features provides a guaranteed lower bound on the reconstruction mean square error (MSE) under any MIA.<n>Then, we derive a differentiable and solvable measure for bounding this conditional entropy based on the Gaussian mixture estimation and propose a conditional entropy algorithm to enhance the inversion robustness
arXiv Detail & Related papers (2025-03-01T07:15:21Z)
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models. It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z)
Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using Cooperative Co-Evolutionary Search [9.511076358998073]
It is essential to identify the hazard boundary of ML Components (MLCs) in the Machine Learning-enabled autonomous systems under analysis. We propose MLCSHE, a novel method based on a Cooperative Co-Evolutionary Algorithm (CCEA) We evaluate the effectiveness and efficiency of MLCSHE on a complex Autonomous Vehicle (AV) case study.
arXiv Detail & Related papers (2023-01-31T17:50:52Z)
Statistical control for spatio-temporal MEG/EEG source imaging with desparsified multi-task Lasso [102.84915019938413]
Non-invasive techniques like magnetoencephalography (MEG) or electroencephalography (EEG) offer promise of non-invasive techniques. The problem of source localization, or source imaging, poses however a high-dimensional statistical inference challenge. We propose an ensemble of desparsified multi-task Lasso (ecd-MTLasso) to deal with this problem.
arXiv Detail & Related papers (2020-09-29T21:17:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.