Reasoning-Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space
- URL: http://arxiv.org/abs/2512.14448v1
- Date: Tue, 16 Dec 2025 14:34:10 GMT
- Title: Reasoning-Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space
- Authors: Xingfu Zhou, Pengfei Wang,
- Abstract summary: Reasoning-Style Poisoning (RSP) manipulates how agents process information rather than what they process.<n>Generative Style Injection (GSI) rewrites retrieved documents into pathological tones.<n>RSP-M is a lightweight runtime monitor that calculates RSV metrics in real-time and triggers alerts when values exceed safety thresholds.
- Score: 4.699272847316498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Model (LLM) agents relying on external retrieval are increasingly deployed in high-stakes environments. While existing adversarial attacks primarily focus on content falsification or instruction injection, we identify a novel, process-oriented attack surface: the agent's reasoning style. We propose Reasoning-Style Poisoning (RSP), a paradigm that manipulates how agents process information rather than what they process. We introduce Generative Style Injection (GSI), an attack method that rewrites retrieved documents into pathological tones--specifically "analysis paralysis" or "cognitive haste"--without altering underlying facts or using explicit triggers. To quantify these shifts, we develop the Reasoning Style Vector (RSV), a metric tracking Verification depth, Self-confidence, and Attention focus. Experiments on HotpotQA and FEVER using ReAct, Reflection, and Tree of Thoughts (ToT) architectures reveal that GSI significantly degrades performance. It increases reasoning steps by up to 4.4 times or induces premature errors, successfully bypassing state-of-the-art content filters. Finally, we propose RSP-M, a lightweight runtime monitor that calculates RSV metrics in real-time and triggers alerts when values exceed safety thresholds. Our work demonstrates that reasoning style is a distinct, exploitable vulnerability, necessitating process-level defenses beyond static content analysis.
Related papers
- Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection [32.301679396929536]
We propose SysName, a framework that shifts the defensive paradigm from static input filtering to execution-aware analysis.<n>SysName synthesizes fragmented operational primitives into contiguous behavioral trajectories, enabling a holistic view of system activity.<n> Empirical evaluations demonstrate that SysName effectively detects over ten distinct compound attack vectors, achieving F1-scores of 85.3% and 66.7% for node-level and path-level end-to-end attack detection, respectively.
arXiv Detail & Related papers (2026-03-04T01:59:16Z) - AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification [25.817251923574286]
We propose a novel inference-time detection and mitigation framework for large language model (LLM) agents.<n>AgentSentry is the first inference-time defense to model multi-turn IPI as a temporal causal takeover.<n>We evaluate AgentSentry on the textscAgentDojo benchmark across four task suites, three IPI attack families, and multiple black-box LLMs.
arXiv Detail & Related papers (2026-02-26T07:59:10Z) - Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections [57.64370755825839]
Self-evolving agents update their internal state across sessions, often by writing and reusing long-term memory.<n>We study this risk and formalize a persistent attack we call a Zombie Agent.<n>We present a black-box attack framework that uses only indirect exposure through attacker-controlled web content.
arXiv Detail & Related papers (2026-02-17T15:28:24Z) - Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z) - ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z) - Defenses Against Prompt Attacks Learn Surface Heuristics [40.392588465939106]
Large language models (LLMs) are increasingly deployed in security-sensitive applications.<n>LLMs may override intended logic when adversarial instructions appear in user queries or retrieved content.<n>Recent defenses rely on supervised fine-tuning with benign and malicious labels.
arXiv Detail & Related papers (2026-01-12T04:12:48Z) - DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion [0.7351161122478707]
Deep neural networks are vulnerable to Trojan (backdoor) attacks.<n> triggerAdaptive inversion reconstructs malicious "shortcut" patterns inserted by an adversary during training.<n>We propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance.
arXiv Detail & Related papers (2025-07-30T16:31:13Z) - CyberRAG: An Agentic RAG cyber attack classification and reporting tool [0.3914676152740142]
CyberRAG is a modular agent-based RAG framework that delivers real-time classification, explanation, and structured reporting for cyber-attacks.<n>Unlike traditional RAG, CyberRAG adopts an agentic design that enables dynamic control flow and adaptive reasoning.
arXiv Detail & Related papers (2025-07-03T08:32:19Z) - Poison Once, Control Anywhere: Clean-Text Visual Backdoors in VLM-based Mobile Agents [54.35629963816521]
This work introduces VIBMA, the first clean-text backdoor attack targeting VLM-based mobile agents.<n>The attack injects malicious behaviors into the model by modifying only the visual input.<n>We show that our attack achieves high success rates while preserving clean-task behavior.
arXiv Detail & Related papers (2025-06-16T08:09:32Z) - Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - LLM-Based User Simulation for Low-Knowledge Shilling Attacks on Recommender Systems [28.559223475725137]
We introduce Agent4SR, a novel framework that leverages Large Language Model (LLM)-based agents to perform low-knowledge, high-impact shilling attacks.<n>Agent4SR simulates realistic user behavior by orchestrating adversarial interactions, selecting items, assigning ratings, and crafting reviews, while maintaining behavioral plausibility.<n>Our findings reveal a new class of emergent threats posed by LLM-driven agents, underscoring the urgent need for enhanced defenses in recommender systems.
arXiv Detail & Related papers (2025-05-18T04:40:34Z) - POISONCRAFT: Practical Poisoning of Retrieval-Augmented Generation for Large Language Models [4.620537391830117]
Large language models (LLMs) are susceptible to hallucinations, which can lead to incorrect or misleading outputs.<n>Retrieval-augmented generation (RAG) is a promising approach to mitigate hallucinations by leveraging external knowledge sources.<n>In this paper, we study a poisoning attack on RAG systems named POISONCRAFT, which can mislead the model to refer to fraudulent websites.
arXiv Detail & Related papers (2025-05-10T09:36:28Z) - AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases [73.04652687616286]
We propose AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base.
Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning.
On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance.
arXiv Detail & Related papers (2024-07-17T17:59:47Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.