LM Agents May Fail to Act on Their Own Risk Knowledge
- URL: http://arxiv.org/abs/2508.13465v1
- Date: Tue, 19 Aug 2025 02:46:08 GMT
- Title: LM Agents May Fail to Act on Their Own Risk Knowledge
- Authors: Yuzhi Tang, Tianxiao Li, Elizabeth Li, Chris J. Maddison, Honghua Dong, Yangjun Ruan,
- Abstract summary: Language model (LM) agents pose a diverse array of potential, severe risks in safety-critical scenarios.<n>While they often answer "Yes" to queries like "Is executing sudo rm -rf /*' dangerous?", they will likely fail to identify such risks in instantiated trajectories.
- Score: 15.60032437959883
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language model (LM) agents have demonstrated significant potential for automating real-world tasks, yet they pose a diverse array of potential, severe risks in safety-critical scenarios. In this work, we identify a significant gap between LM agents' risk awareness and safety execution abilities: while they often answer "Yes" to queries like "Is executing `sudo rm -rf /*' dangerous?", they will likely fail to identify such risks in instantiated trajectories or even directly perform these risky actions when acting as agents. To systematically investigate this, we develop a comprehensive evaluation framework to examine agents' safety across three progressive dimensions: 1) their knowledge about potential risks, 2) their ability to identify corresponding risks in execution trajectories, and 3) their actual behaviors to avoid executing these risky actions. Our evaluation reveals two critical performance gaps that resemble the generator-validator gaps observed in LMs: while agents demonstrate near-perfect risk knowledge ($>98\%$ pass rates), they fail to apply this knowledge when identifying risks in actual scenarios (with performance dropping by $>23\%$) and often still execute risky actions ($<26\%$ pass rates). Notably, this trend persists across more capable LMs as well as in specialized reasoning models like DeepSeek-R1, indicating that simply scaling model capabilities or inference compute does not inherently resolve safety concerns. Instead, we take advantage of these observed gaps to develop a risk verifier that independently critiques the proposed actions by agents, with an abstractor that converts specific execution trajectories into abstract descriptions where LMs can more effectively identify the risks. Our overall system achieves a significant reduction of risky action execution by $55.3\%$ over vanilla-prompted agents.
Related papers
- The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment [148.80266237240713]
implicit training-time safety risks are driven by a model's internal incentives and contextual background information.<n>We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types.<n>Our results identify an overlooked yet urgent safety challenge in training.
arXiv Detail & Related papers (2026-02-04T04:23:58Z) - When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails [74.63933201261595]
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks.<n>LRMs remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks.<n>We propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps.
arXiv Detail & Related papers (2025-10-24T09:32:25Z) - Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents [30.378925170216835]
Self-replication risk of Large Language Model (LLM) agents driven by objective misalignment has drawn growing attention.<n>We present a comprehensive evaluation framework for quantifying self-replication risks.
arXiv Detail & Related papers (2025-09-29T17:49:50Z) - SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents [58.21223208538351]
This work explores the security issues surrounding mobile multimodal agents.<n>It attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information.<n>It also designs an automated assisted assessment scheme based on a large language model.
arXiv Detail & Related papers (2025-07-01T15:10:00Z) - IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks [30.535665641990114]
We present IS-Bench, the first multi-modal benchmark designed for interactive safety.<n>It features 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator.<n>It facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps.
arXiv Detail & Related papers (2025-06-19T15:34:46Z) - AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions [64.85086226439954]
We present SAFE, a benchmark for assessing the safety of embodied VLM agents on hazardous instructions.<n> SAFE comprises three components: SAFE-THOR, SAFE-VERSE, and SAFE-DIAGNOSE.<n>We uncover systematic failures in translating hazard recognition into safe planning and execution.
arXiv Detail & Related papers (2025-06-17T16:37:35Z) - Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models [29.569220030102986]
We introduce textbfBeyond Safe Answers (BSA) bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types.<n> Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales.<n>Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.
arXiv Detail & Related papers (2025-05-26T08:49:19Z) - Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models [63.559461750135334]
Language models (LMs) are increasingly used to build agents that can act autonomously to achieve goals.<n>We study this "answer-or-defer" problem with an evaluation framework that systematically varies human-specified risk structures.<n>We find that a simple skill-decomposition method, which isolates the independent skills required for answer-or-defer decision making, can consistently improve LMs' decision policies.
arXiv Detail & Related papers (2025-03-03T09:16:26Z) - AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection [47.83354878065321]
We propose AGrail, a lifelong guardrail to enhance agent safety.<n>AGrail features adaptive safety check generation, effective safety check optimization, and tool compatibility and flexibility.
arXiv Detail & Related papers (2025-02-17T05:12:33Z) - Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents [10.565508277042564]
Large language models (LLMs) are evolving into autonomous decision-makers, raising concerns about catastrophic risks in high-stakes scenarios.<n>Based on the insight that such risks can originate from trade-offs between the agent's Helpful, Harmlessness and Honest (HHH) goals, we build a novel three-stage evaluation framework.<n>We conduct 14,400 agentic simulations across 12 advanced LLMs, with extensive experiments and analysis.
arXiv Detail & Related papers (2025-02-17T02:11:17Z) - Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy [65.77763092833348]
This perspective examines vulnerabilities in AI scientists, shedding light on potential risks associated with their misuse.<n>We take into account user intent, the specific scientific domain, and their potential impact on the external environment.<n>We propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback.
arXiv Detail & Related papers (2024-02-06T18:54:07Z) - The Reasoning Under Uncertainty Trap: A Structural AI Risk [0.0]
Report provides an exposition of what makes RUU so challenging for both humans and machines.
We detail how this misuse risk connects to a wider network of underlying structural risks.
arXiv Detail & Related papers (2024-01-29T17:16:57Z) - Safety Margins for Reinforcement Learning [53.10194953873209]
We show how to leverage proxy criticality metrics to generate safety margins.
We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.