Fugu-MT 論文翻訳(概要): RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

論文の概要: RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

arxiv url: http://arxiv.org/abs/2606.00828v1
Date: Sat, 30 May 2026 17:55:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:28.869911
Title: RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes
Title（参考訳）: RoboStressBench: 身体的シーンにおける物理的視覚ストレスに対するVLMロバスト性のベンチマーク
Authors: Leyi Wu, Yifan Zhao, Jinjie Zhang, Suzeyu Chen, Wosong Chen, Zhifei Chen, Tianshuo Xu, Qingchun He, Hongxin Hu, Haojian Huang, Yangkai Wei, Wenqian Li, Yinchuan Li, Ying-Cong Chen,
Abstract要約: VLM(Vision-Language Models)は、強力な視覚的理解を示し、組込みAIシステムにますますデプロイされている。既存のベンチマークでは、物理的なシーン形成によるストレスではなく、クリーンなイメージや孤立した摂動を用いてVLMを評価する。具体的シーンにおける物理的な視覚的ストレスに対するVLMロバスト性を評価するためのベンチマークであるRoboStressBenchを紹介する。
参考スコア（独自算出の注目度）: 47.11389355477851
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.
Abstract（参考訳）: VLM(Vision-Language Models)は、強力な視覚的理解を示し、実環境下での信頼性の認識が不可欠である、具体化されたAIシステムにますます導入されている。しかし、既存のベンチマークでは、物理的シーン形成によるストレスではなく、クリーンな画像や孤立した摂動を用いてVLMを評価する。このデザインには2つの制限がある: 日常的な視覚的ストレスの狭い部分のみをカバーし、いくつかの摂動は現実的なエンボディシーンにはほとんど現れない。このギャップは、物理的な環境で遭遇する様々な要因を捉えるために、どのようにして視覚的ストレスを原則的に定義できるのかという根本的な疑問を提起する。この問題に対処するために、逆グラフィックスの観点から視覚知覚を定式化し、実写シーンにおけるVLMの物理的ストレスに対する堅牢性を評価するベンチマークであるRoboStressBenchを導入する。物理レンダリング方程式に着想を得たRoboStressBenchは、視覚的ストレスを物質(M)、視点(V)、照明(L)、幾何学(G)の4つの物理的基底次元に分解する。この設計により、RoboStressBenchは現実世界の環境における幅広い視覚的ストレスをカバーできると同時に、視覚認識、推論、計画などのVLM能力に対する影響を制御できる。最新のVLMの総合評価を通じて、ストレス特異的な障害モードを特定し、異なる物理的要因が異なるエンボディ機能に劣化することを明らかにする。さらに、視覚的ストレスを検知し、推論の前に視覚編集スキルを起動するストレス対応エージェント解決器を導入し、高ストレスシナリオにおける堅牢性を向上させる。 RoboStressBenchは、現実の物理的ストレス下でのVLM知覚の診断と改善のための、原則化された評価フレームワークを提供する。

論文の概要: RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

関連論文リスト