Fugu-MT 論文翻訳(概要): HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

論文の概要: HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

arxiv url: http://arxiv.org/abs/2603.11975v1
Date: Thu, 12 Mar 2026 14:25:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.141344
Title: HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
Title（参考訳）: ホームセーフベンチ:家庭シナリオにおける非安全行動検出のためのビジョンランゲージモデルの評価
Authors: Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu,
Abstract要約: 世帯シナリオにおける安全でない行動検出における視覚言語モデル(VLM)の評価のためのベンチマークである textbfHomeSafe-Bench を紹介する。また、リアルタイムの安全監視のための階層型ストリーミングアーキテクチャである、世帯安全のためのデュアルブラインドガード(HD-Guard)を提案する。
参考スコア（独自算出の注目度）: 10.375753259643
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.
Abstract（参考訳）: エンボディエージェントの急速な進化により、実環境における家庭用ロボットの展開が加速した。しかし、構造的な産業環境とは異なり、家庭空間には予測不可能な安全リスクが伴うため、知覚遅延や常識知識の欠如といったシステム制限が危険なエラーを引き起こす可能性がある。現在の安全性評価は、静的イメージ、テキスト、あるいは一般的なハザードに制限されることが多いが、これらの特定のコンテキストにおいて動的に安全でないアクション検出を適切にベンチマークすることができない。このギャップを埋めるために、家庭のシナリオにおける安全でない行動検出に関するビジョン・ランゲージ・モデル(VLM)を評価するために設計された、挑戦的なベンチマークである \textbf{HomeSafe-Bench} を導入する。 HomeSafe-Benchは、物理シミュレーションと高度なビデオ生成を組み合わせたハイブリッドパイプラインを通じて委託されており、細かな多次元アノテーションを備えた6つの機能領域にわたる438の多様なケースが特徴である。ベンチマークの他に、リアルタイムの安全監視のための階層型ストリーミングアーキテクチャである「家庭安全のための階層的デュアルブラインドガード(HD-Guard)」を提案する。 HD-Guardは、連続した高周波スクリーニングのための軽量なFastBrainと、深いマルチモーダル推論のための非同期の大規模SlowBrainをコーディネートし、推論効率と検出精度を効果的にバランスさせる。評価の結果,HD-Guardはレイテンシと性能のトレードオフが優れていることがわかった。

論文の概要: HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

関連論文リスト