Related papers: HomeSafeBench: A Benchmark for Embodied Vision-Language Models in Free-Exploration Home Safety Inspection

HomeSafeBench: A Benchmark for Embodied Vision-Language Models in Free-Exploration Home Safety Inspection

URL: http://arxiv.org/abs/2509.23690v1
Date: Sun, 28 Sep 2025 07:01:27 GMT
Title: HomeSafeBench: A Benchmark for Embodied Vision-Language Models in Free-Exploration Home Safety Inspection
Authors: Siyuan Gao, Jiashu Yao, Haoyu Wen, Yuhang Guo, Zeming Liu, Heyan Huang,
Abstract summary: Embodied agents can identify and report safety hazards in the home environments.<n>Existing benchmarks suffer from two key limitations.<n>HomeSafeBench is a benchmark with 12,900 data points covering five common home safety hazards.
Score: 45.2338049870908
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embodied agents can identify and report safety hazards in the home environments. Accurately evaluating their capabilities in home safety inspection tasks is curcial, but existing benchmarks suffer from two key limitations. First, they oversimplify safety inspection tasks by using textual descriptions of the environment instead of direct visual information, which hinders the accurate evaluation of embodied agents based on Vision-Language Models (VLMs). Second, they use a single, static viewpoint for environmental observation, which restricts the agents' free exploration and cause the omission of certain safety hazards, especially those that are occluded from a fixed viewpoint. To alleviate these issues, we propose HomeSafeBench, a benchmark with 12,900 data points covering five common home safety hazards: fire, electric shock, falling object, trips, and child safety. HomeSafeBench provides dynamic first-person perspective images from simulated home environments, enabling the evaluation of VLM capabilities for home safety inspection. By allowing the embodied agents to freely explore the room, HomeSafeBench provides multiple dynamic perspectives in complex environments for a more thorough inspection. Our comprehensive evaluation of mainstream VLMs on HomeSafeBench reveals that even the best-performing model achieves an F1-score of only 10.23%, demonstrating significant limitations in current VLMs. The models particularly struggle with identifying safety hazards and selecting effective exploration strategies. We hope HomeSafeBench will provide valuable reference and support for future research related to home security inspections. Our dataset and code will be publicly available soon.

Related papers

SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond [134.43113804188195]
We introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts.<n>SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement.
arXiv Detail & Related papers (2026-03-02T08:16:04Z)
Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs [61.01470415470677]
Large vision-language models (LVLMs) exhibit remarkable capabilities in cross-modal tasks but face significant safety challenges.<n>Existing benchmarks are hindered by their labor-intensive construction process, static complexity, and limited discriminative power.<n>We propose VLSafetyBencher, the first automated system for LVLM safety benchmarking.
arXiv Detail & Related papers (2026-01-27T11:51:30Z)
Is Your VLM for Autonomous Driving Safety-Ready? A Comprehensive Benchmark for Evaluating External and In-Cabin Risks [24.48209914161689]
Vision-Language Models (VLMs) show great promise for autonomous driving, but their suitability for safety-critical scenarios is largely unexplored.<n>This issue arises from the lack of comprehensive benchmarks that assess both external environmental risks and in-cabin driving behavior safety simultaneously.<n>We introduce DSBench, the first comprehensive Driving Safety Benchmark to assess a VLM's awareness of various safety risks in a unified manner.
arXiv Detail & Related papers (2025-11-18T15:33:49Z)
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows [77.95511352806261]
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms.<n>We propose OS-Sentinel, a novel hybrid safety detection framework that combines a Formal Verifier for detecting explicit system-level violations with a Contextual Judge for assessing contextual risks and agent actions.
arXiv Detail & Related papers (2025-10-28T13:22:39Z)
iSafetyBench: A video-language benchmark for safety in industrial environment [6.697702130929693]
iSafetyBench is a new video-language benchmark designed to evaluate model performance in industrial environments.<n>iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings.<n>We evaluate eight state-of-the-art video-language models under zero-shot conditions.
arXiv Detail & Related papers (2025-08-01T07:55:53Z)
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model [52.72318433518926]
Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content.<n>We introduce a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations.<n>We propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head.
arXiv Detail & Related papers (2025-06-05T07:26:34Z)
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model [29.63418384788804]
We conduct a safety evaluation of 11 Multimodal Large Reasoning Models (MLRMs) across 5 benchmarks.<n>Our analysis reveals distinct safety patterns across different benchmarks.<n>It is a potential approach to address safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent.
arXiv Detail & Related papers (2025-05-10T06:59:36Z)
Using Vision Language Models for Safety Hazard Identification in Construction [1.2343292905447238]
We propose and experimentally validated a Vision Language Model (VLM)-based framework for the identification of construction hazards.<n>We evaluate state-of-the-art VLMs, including GPT-4o, Gemini, Llama 3.2, and InternVL2, using a custom dataset of 1100 construction site images.
arXiv Detail & Related papers (2025-04-12T05:11:23Z)
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs [78.99703366417661]
Large language models (LLMs) increasingly assist in tasks ranging from procedural guidance to autonomous experiment orchestration.<n>Such overreliance is particularly dangerous in high-stakes laboratory settings, where failures in hazard identification or risk assessment can result in severe accidents.<n>We propose the Laboratory Safety Benchmark (LabSafety Bench) to evaluate models on their ability to identify potential hazards, assess risks, and predict the consequences of unsafe actions in lab environments.
arXiv Detail & Related papers (2024-10-18T05:21:05Z)
Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.<n>For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.<n>We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.