Fugu-MT 論文翻訳(概要): Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

論文の概要: Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

arxiv url: http://arxiv.org/abs/2605.01970v2
Date: Tue, 05 May 2026 11:52:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 14:45:21.253221
Title: Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
Title（参考訳）: Trojan Hippo: データ抽出のための兵器メモリ
Authors: Debeshee Das, Julien Piet, Darya Kaviani, Luca Beurer-Kellner, Florian Tramèr, David Wagner,
Abstract要約: トロイジャン・ヒッポ(Trojan Hippo)は、より現実的な脅威モデルで機能する永続メモリ攻撃のクラスである。基本的なセキュリティ原則から着想を得た4つのメモリシステム防御を評価し,攻撃成功率を大幅に低下させることを確認した。この相当なセキュリティとユーティリティのトレードオフのため、防衛の効果的な実世界の展開は、依然としてオープンな課題である。
参考スコア（独自算出の注目度）: 33.8989871605613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Memory systems enable otherwise-stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work: the attacker plants a dormant payload into an agent's long-term memory via a single untrusted tool call (e.g., a crafted email), which activates only when the user later discusses sensitive topics such as finance, health, or identity, and exfiltrates high-value personal data to the attacker. While anecdotal demonstrations of such attacks have appeared against deployed systems, no prior work systematically evaluates them across heterogeneous memory architectures and defenses. We introduce a dynamic evaluation framework comprising two components: (1) an OpenEvolve-based adaptive red-teaming benchmark that stress-tests defenses and memory backends against continuously refined attacks, and (2) the first capability-aware security/utility analysis for persistent memory systems, enabling principled reasoning about defense deployment across different usage profiles. Instantiated on an email assistant across four memory backends (explicit tool memory, agentic memory, RAG, and sliding-window context), Trojan Hippo achieves up to 85-100% ASR against current frontier models from OpenAI and Google, with planted memories successfully activating even after 100 benign sessions. We evaluate four memory-system defenses inspired by basic security principles, finding they substantially reduce attack success rates (to as low as 0-5%), though at utility costs that vary widely with task requirements. Because of this substantial security-utility tradeoff, the effective real-world deployment of defenses remains an open challenge, which our evaluation framework is specifically designed to address.
Abstract（参考訳）: メモリシステムは、他のステートレスなLLMエージェントがセッション間でユーザー情報を永続化できるだけでなく、新たなアタックサーフェスも導入できる。攻撃者は、単一の信頼できないツールコール(例えば、工芸メール)を介して、エージェントの長期記憶に休息ペイロードを配置し、ユーザが後に金融、健康、アイデンティティなどの機密性の高いトピックを議論したときにのみ起動し、攻撃者に高価値な個人情報を流出させる。このような攻撃の逸話的なデモンストレーションは、デプロイされたシステムに対して現れているが、以前の研究では、それらを異種メモリアーキテクチャやディフェンスで体系的に評価することはなかった。本研究では,(1)OpenEvolveをベースとした適応型レッドチームベンチマークにより,防御とメモリバックエンドを連続的に強化した攻撃に対してストレステストし,(2)永続メモリシステムに対する最初の能力認識型セキュリティ・ユーティリティ分析を行うことにより,防衛配置に関する原則的推論を可能にする,2つのコンポーネントからなる動的評価フレームワークを提案する。 4つのメモリバックエンド(ツールメモリ、エージェントメモリ、RAG、スライドウィンドウコンテキスト)にわたるEメールアシスタントを基盤として、Trojan Hippoは、OpenAIとGoogleの現在のフロンティアモデルに対して最大85-100%のASRを達成した。基本的なセキュリティ原則にインスパイアされた4つのメモリシステム防御を評価し,攻撃成功率(0-5%まで)を著しく低減するが,タスク要求によって大きく異なるユーティリティコストで評価する。このような大規模なセキュリティとユーティリティのトレードオフのため、防衛の効果的な実世界の展開は依然としてオープンな課題であり、我々の評価フレームワークは特に対処するように設計されています。

論文の概要: Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

関連論文リスト