Related papers: Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

URL: http://arxiv.org/abs/2601.05504v2
Date: Mon, 12 Jan 2026 03:35:39 GMT
Title: Memory Poisoning Attack and Defense on Memory Based LLM-Agents
Authors: Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Mallik, Shuchi Mishra,
Abstract summary: Large language model agents equipped with persistent memory are vulnerable to memory poisoning attacks.<n>Recent work demonstrated that the MINJA (Memory Injection Attack) achieves over 95 % injection success rate.<n>This work addresses gaps through systematic empirical evaluation of memory poisoning attacks and defenses.
Score: 3.7127635602605014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model agents equipped with persistent memory are vulnerable to memory poisoning attacks, where adversaries inject malicious instructions through query only interactions that corrupt the agents long term memory and influence future responses. Recent work demonstrated that the MINJA (Memory Injection Attack) achieves over 95 % injection success rate and 70 % attack success rate under idealized conditions. However, the robustness of these attacks in realistic deployments and effective defensive mechanisms remain understudied. This work addresses these gaps through systematic empirical evaluation of memory poisoning attacks and defenses in Electronic Health Record (EHR) agents. We investigate attack robustness by varying three critical dimensions: initial memory state, number of indication prompts, and retrieval parameters. Our experiments on GPT-4o-mini, Gemini-2.0-Flash and Llama-3.1-8B-Instruct models using MIMIC-III clinical data reveal that realistic conditions with pre-existing legitimate memories dramatically reduce attack effectiveness. We then propose and evaluate two novel defense mechanisms: (1) Input/Output Moderation using composite trust scoring across multiple orthogonal signals, and (2) Memory Sanitization with trust-aware retrieval employing temporal decay and pattern-based filtering. Our defense evaluation reveals that effective memory sanitization requires careful trust threshold calibration to prevent both overly conservative rejection (blocking all entries) and insufficient filtering (missing subtle attacks), establishing important baselines for future adaptive defense mechanisms. These findings provide crucial insights for securing memory-augmented LLM agents in production environments.

Related papers

Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections [57.64370755825839]
Self-evolving agents update their internal state across sessions, often by writing and reusing long-term memory.<n>We study this risk and formalize a persistent attack we call a Zombie Agent.<n>We present a black-box attack framework that uses only indirect exposure through attacker-controlled web content.
arXiv Detail & Related papers (2026-02-17T15:28:24Z)
A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory [31.673865459672285]
Large Language Model (LLM) agents use memory to learn from past interactions.<n>An adversary can inject seemingly harmless records into an agent's memory to manipulate its future behavior.<n>A-MemGuard is the first proactive defense framework for LLM agent memory.
arXiv Detail & Related papers (2025-09-29T16:04:15Z)
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP [51.04452017089568]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective defense mechanism that operates on text prompts to indirectly purify CLIP.<n>CBPT significantly mitigates backdoor threats while preserving model utility.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
Proactive Privacy Amnesia for Large Language Models: Safeguarding PII with Negligible Impact on Model Utility [39.51362903320998]
We propose a novel approach, Proactive Privacy Amnesia, to safeguard PII in large language models (LLMs)<n>This mechanism works by actively identifying and forgetting key memories most closely associated with PII in sequences, followed by a memory implanting to maintain the LLM's functionality.<n>Results show that our PPA method completely eliminates the risk of phone number exposure by 100% and significantly reduces the risk of physical address exposure by 9.8% - 87.6%.
arXiv Detail & Related papers (2025-02-24T19:16:39Z)
Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs [3.7913442178940318]
Modern large language models (LLMs) exhibit critical vulnerabilities to poison pill attacks.<n>We demonstrate these attacks exploit inherent architectural properties of LLMs.<n>Our work establishes poison pills as both a security threat and diagnostic tool.
arXiv Detail & Related papers (2025-02-23T06:34:55Z)
Game-Theoretic Defenses for Robust Conformal Prediction Against Adversarial Attacks in Medical Imaging [12.644923600594176]
Adversarial attacks pose significant threats to the reliability and safety of deep learning models.<n>This paper introduces a novel framework that integrates conformal prediction with game-theoretic defensive strategies.
arXiv Detail & Related papers (2024-11-07T02:20:04Z)
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases [73.04652687616286]
We propose AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance.
arXiv Detail & Related papers (2024-07-17T17:59:47Z)
Malicious Agent Detection for Robust Multi-Agent Collaborative Perception [52.261231738242266]
Multi-agent collaborative (MAC) perception is more vulnerable to adversarial attacks than single-agent perception. We propose Malicious Agent Detection (MADE), a reactive defense specific to MAC perception. We conduct comprehensive evaluations on a benchmark 3D dataset V2X-sim and a real-road dataset DAIR-V2X.
arXiv Detail & Related papers (2023-10-18T11:36:42Z)
RECESS Vaccine for Federated Learning: Proactive Defense Against Model Poisoning Attacks [20.55681622921858]
Model poisoning attacks greatly jeopardize the application of federated learning (FL) In this work, we propose a novel proactive defense named RECESS against model poisoning attacks. Unlike previous methods that score each iteration, RECESS considers clients' performance correlation across multiple iterations to estimate the trust score.
arXiv Detail & Related papers (2023-10-09T06:09:01Z)
On Practical Aspects of Aggregation Defenses against Data Poisoning Attacks [58.718697580177356]
Attacks on deep learning models with malicious training samples are known as data poisoning. Recent advances in defense strategies against data poisoning have highlighted the effectiveness of aggregation schemes in achieving certified poisoning robustness. Here we focus on Deep Partition Aggregation, a representative aggregation defense, and assess its practical aspects, including efficiency, performance, and robustness.
arXiv Detail & Related papers (2023-06-28T17:59:35Z)
Avoid Adversarial Adaption in Federated Learning by Multi-Metric Investigations [55.2480439325792]
Federated Learning (FL) facilitates decentralized machine learning model training, preserving data privacy, lowering communication costs, and boosting model performance through diversified data sources. FL faces vulnerabilities such as poisoning attacks, undermining model integrity with both untargeted performance degradation and targeted backdoor attacks. We define a new notion of strong adaptive adversaries, capable of adapting to multiple objectives simultaneously. MESAS is the first defense robust against strong adaptive adversaries, effective in real-world data scenarios, with an average overhead of just 24.37 seconds.
arXiv Detail & Related papers (2023-06-06T11:44:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.