Related papers: SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents

SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents

URL: http://arxiv.org/abs/2509.25885v1
Date: Tue, 30 Sep 2025 07:24:04 GMT
Title: SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents
Authors: Ruolin Chen, Yinqian Sun, Jihang Wang, Mingyang Lv, Qian Zhang, Yi Zeng,
Abstract summary: Embodied agents powered by large language models (LLMs) inherit advanced planning capabilities; however, their direct interaction with the physical world exposes them to safety vulnerabilities.<n>We present SafeMindBench, a multimodal benchmark with 5,558 samples spanning four task categories (Instr-Risk, Env-Risk, Order-Fix, Req-Align) across high-risk scenarios such as sabotage, harm, privacy, and illegal behavior.<n>We introduce SafeMindAgent, a modular Planner-Executor architecture integrated with three cascaded safety modules, which incorporate safety constraints into the reasoning process.
Score: 7.975014390527644
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied agents powered by large language models (LLMs) inherit advanced planning capabilities; however, their direct interaction with the physical world exposes them to safety vulnerabilities. In this work, we identify four key reasoning stages where hazards may arise: Task Understanding, Environment Perception, High-Level Plan Generation, and Low-Level Action Generation. We further formalize three orthogonal safety constraint types (Factual, Causal, and Temporal) to systematically characterize potential safety violations. Building on this risk model, we present SafeMindBench, a multimodal benchmark with 5,558 samples spanning four task categories (Instr-Risk, Env-Risk, Order-Fix, Req-Align) across high-risk scenarios such as sabotage, harm, privacy, and illegal behavior. Extensive experiments on SafeMindBench reveal that leading LLMs (e.g., GPT-4o) and widely used embodied agents remain susceptible to safety-critical failures. To address this challenge, we introduce SafeMindAgent, a modular Planner-Executor architecture integrated with three cascaded safety modules, which incorporate safety constraints into the reasoning process. Results show that SafeMindAgent significantly improves safety rate over strong baselines while maintaining comparable task completion. Together, SafeMindBench and SafeMindAgent provide both a rigorous evaluation suite and a practical solution that advance the systematic study and mitigation of safety risks in embodied LLM agents.

Related papers

Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs [61.01470415470677]
Large vision-language models (LVLMs) exhibit remarkable capabilities in cross-modal tasks but face significant safety challenges.<n>Existing benchmarks are hindered by their labor-intensive construction process, static complexity, and limited discriminative power.<n>We propose VLSafetyBencher, the first automated system for LVLM safety benchmarking.
arXiv Detail & Related papers (2026-01-27T11:51:30Z)
How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity [55.441602598245744]
Current safety evaluations for LLM-driven agents primarily focus on atomic harms, failing to address sophisticated threats where malicious intent is concealed or diluted within complex tasks.<n>We address this gap with a two-dimensional analysis of agent safety brittleness under the pressures of intent concealment and task complexity.<n>Our findings reveal two critical phenomena: safety alignment degrades sharply and predictably as intent becomes obscured, and a "Complexity Paradox" emerges, where agents seem safer on harder tasks only due to capability limitations.
arXiv Detail & Related papers (2025-11-11T17:27:27Z)
DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI Agents [12.054307827384415]
Large Language Models (LLMs) have become increasingly prominent, severely constraining their trustworthy deployment in critical domains.<n>This paper proposes a novel safety response framework designed to safeguard LLMs at both the input and output levels.
arXiv Detail & Related papers (2025-11-05T03:04:35Z)
IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks [30.535665641990114]
We present IS-Bench, the first multi-modal benchmark designed for interactive safety.<n>It features 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator.<n>It facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps.
arXiv Detail & Related papers (2025-06-19T15:34:46Z)
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions [76.74726258534142]
We propose AGENTSAFE, the first benchmark for evaluating the safety of embodied VLM agents under hazardous instructions.<n> AGENTSAFE simulates realistic agent-environment interactions within a simulation sandbox.<n> benchmark includes 45 adversarial scenarios, 1,350 hazardous tasks, and 8,100 hazardous instructions.
arXiv Detail & Related papers (2025-06-17T16:37:35Z)
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model [29.63418384788804]
We conduct a safety evaluation of 11 Multimodal Large Reasoning Models (MLRMs) across 5 benchmarks.<n>Our analysis reveals distinct safety patterns across different benchmarks.<n>It is a potential approach to address safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent.
arXiv Detail & Related papers (2025-05-10T06:59:36Z)
A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents [13.225168384790257]
Large Language Models (LLMs) exhibit substantial promise in enhancing task-planning capabilities within embodied agents.<n>We present Safe-BeAl, an integrated framework for the measurement (SafePlan-Bench) and alignment (Safe-Align) of LLM-based embodied agents' behaviors.<n>Our empirical analysis reveals that even in the absence of adversarial inputs or malicious intent, LLM-based agents can exhibit unsafe behaviors.
arXiv Detail & Related papers (2025-04-20T15:12:14Z)
Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a benchmark designed to evaluate the safety of large language models (LLMs)<n>Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.<n>Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z)
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents [58.65256663334316]
We present SafeAgentBench -- the first benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments.<n>SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 9 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives.
arXiv Detail & Related papers (2024-12-17T18:55:58Z)
Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment.<n>To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.<n>Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.