Fugu-MT 論文翻訳(概要): Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

論文の概要: Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

arxiv url: http://arxiv.org/abs/2605.29224v1
Date: Thu, 28 May 2026 01:23:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.578907
Title: Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents
Title（参考訳）: 脆弱性としての関連性: LLMエージェントの安全アライメントをWeb検索がいかに低下させるか
Authors: Aditya Nawal, Manit Baser, Mohan Gurusamy,
Abstract要約: 本稿では,LLMエージェントの検索による安全性劣化を診断するフレームワークであるAgentREVEALを紹介する。警告を含むページやリスク宣言を含むページなど,反対あるいは安全指向のソースであっても,有害なコンプライアンスを平均25%増加させることができることを示す。関連性も検索に役立つので、これらの結果は検索可能なエージェントの安全利用トレードオフを露呈する。
参考スコア（独自算出の注目度）: 2.905751301655124
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety-oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no-retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety-utility trade-off for retrieval-enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real-world URLs paired with 320 harmful behaviors to support future evaluations.
Abstract（参考訳）: AIエージェントは、Web検索などの外部ツールで大規模な言語モデルを拡張し、接地と最新応答を可能にする。しかし、生成パイプラインに外部コンテンツを統合することで、モデル出力を管理する安全アライメント機構が弱まる可能性がある。以前の研究は、エージェントの検索が有害な要求に対するコンプライアンスを高めることを示している。本稿では,LLMエージェントの検索による安全性劣化を診断するフレームワークであるAgentREVEALを紹介する。このフレームワークは、検索をエージェントパイプラインに統合する方法と、検索したコンテンツの特性の2つの軸を調べる。統合軸に沿って、単一のステップにおけるバインディングツールの実行と応答生成が有害な出力を増幅することを発見した。コンテンツ軸に沿って、セーフソースパラドックス(Safe Source Paradox)を明らかにする。警告を含むページやリスク宣言を含むページなど、反対あるいは安全指向のソースであっても、非検索ベースラインと比較して、有害なコンプライアンスを平均25%増加させることができる。最後に、関連性は両方の脆弱性の共有アクティベーション条件として機能することを示す。同様のパターンがフロンティアクローズドモデルに現れ、いくつかの代表的なパイプライン介入の下で有害なコンプライアンスが上昇し続けており、一部のエージェントは自律的な検索の下でこの体制に入る。関連性も検索に役立つので、これらの結果は検索可能なエージェントの安全利用トレードオフを露呈する。 HarmURLBenchは、1,405の現実世界のURLと320の有害な振る舞いを組み合わせたベンチマークで、将来の評価をサポートする。

論文の概要: Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

関連論文リスト