Fugu-MT 論文翻訳(概要): LM Agents May Fail to Act on Their Own Risk Knowledge

論文の概要: LM Agents May Fail to Act on Their Own Risk Knowledge

arxiv url: http://arxiv.org/abs/2508.13465v1
Date: Tue, 19 Aug 2025 02:46:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.775154
Title: LM Agents May Fail to Act on Their Own Risk Knowledge
Title（参考訳）: LMのエージェント、リスク知識の保護を怠る可能性-関係者
Authors: Yuzhi Tang, Tianxiao Li, Elizabeth Li, Chris J. Maddison, Honghua Dong, Yangjun Ruan,
Abstract要約: 言語モデル(LM)エージェントは、安全クリティカルなシナリオにおいて、様々な潜在的な、深刻なリスクをもたらす。 Sudo rm -rf /*' が危険なのか?」といった質問に対して "Yes" と答えることが多いが、インスタンス化された軌跡におけるそのようなリスクを特定できない可能性が高い。
参考スコア（独自算出の注目度）: 15.60032437959883
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language model (LM) agents have demonstrated significant potential for automating real-world tasks, yet they pose a diverse array of potential, severe risks in safety-critical scenarios. In this work, we identify a significant gap between LM agents' risk awareness and safety execution abilities: while they often answer "Yes" to queries like "Is executing `sudo rm -rf /*' dangerous?", they will likely fail to identify such risks in instantiated trajectories or even directly perform these risky actions when acting as agents. To systematically investigate this, we develop a comprehensive evaluation framework to examine agents' safety across three progressive dimensions: 1) their knowledge about potential risks, 2) their ability to identify corresponding risks in execution trajectories, and 3) their actual behaviors to avoid executing these risky actions. Our evaluation reveals two critical performance gaps that resemble the generator-validator gaps observed in LMs: while agents demonstrate near-perfect risk knowledge ($>98\%$ pass rates), they fail to apply this knowledge when identifying risks in actual scenarios (with performance dropping by $>23\%$) and often still execute risky actions ($<26\%$ pass rates). Notably, this trend persists across more capable LMs as well as in specialized reasoning models like DeepSeek-R1, indicating that simply scaling model capabilities or inference compute does not inherently resolve safety concerns. Instead, we take advantage of these observed gaps to develop a risk verifier that independently critiques the proposed actions by agents, with an abstractor that converts specific execution trajectories into abstract descriptions where LMs can more effectively identify the risks. Our overall system achieves a significant reduction of risky action execution by $55.3\%$ over vanilla-prompted agents.
Abstract（参考訳）: 言語モデル(LM)エージェントは、現実世界のタスクを自動化するための大きな可能性を示しているが、安全クリティカルなシナリオにおいて、様々な潜在的な、深刻なリスクが生じる。本研究は,LMエージェントのリスク認識能力と安全実行能力の間に,大きなギャップを見出すものである。「危険に対処する」「危険に対処する」といった質問に対して,彼らはしばしば「Yes」に答えるが,このようなリスクをインスタンス化された軌跡で特定したり,エージェントとして振る舞う際に,直接このようなリスク行動を実行することに失敗する可能性が高い。これを体系的に調査するため,3つの段階にわたるエージェントの安全性を総合的に評価する枠組みを開発した。 1)潜在的なリスクについての知識。 2【実行軌跡における対応リスクを識別する能力】 3)リスクのある行動の実行を避けるための実際の行動。エージェントは、ほぼ完璧なリスク知識(>98\%$パスレート)を示す一方で、実際のシナリオにおけるリスクを特定する際には、(パフォーマンスが>23\%$に低下する)この知識を適用することができず、リスクのあるアクション(<26\%$パスレート)を実行する場合も少なくない。この傾向は、より有能なLMだけでなく、DeepSeek-R1のような特別な推論モデルにも持続する。その代わりに、これらの観察されたギャップを利用して、エージェントによる提案されたアクションを独立して批判するリスク検証器を開発し、特定の実行軌跡を、LMがより効果的にリスクを識別できる抽象的な記述に変換する抽象化器を開発した。我々のシステム全体では、バニラプロンプト剤よりも5.3 %の危険行動実行を著しく削減できる。

論文の概要: LM Agents May Fail to Act on Their Own Risk Knowledge

関連論文リスト