FuguReport

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Authors Leyi Wu, Yifan Zhao, Jinjie Zhang, Suzeyu Chen, Wosong Chen, Zhifei Chen, Tianshuo Xu, Qingchun He, Hongxin Hu, Haojian Huang, Yangkai Wei, Wenqian Li, Yinchuan Li, Ying-Cong Chen
Affiliations The Hong Kong University of Science and Technology / Knowin
Categories Evaluation / Robustness Evaluation / VLM physical visual stress robustness, Application / Embodied AI / Visual stress in embodied scenes, Task / Benchmarking / RoboStressBench physical stress benchmark
License CC BY 4.0

Abstract Overview

RoboStressBench is a benchmark for evaluating vision-language model robustness to physical visual stress in embodied scenes, rather than only on clean images or synthetic corruptions. The paper formulates visual stress through an image-formation perspective and organizes it into four physically grounded dimensions: Material, Viewpoint, Lighting, and Geometry. The benchmark is built from filtered real cases, controlled stress synthesis, and additional real-world collection, yielding an approximately 7.2K-example dataset that supports both multiple-choice VQA and grounding tasks. Using this setup, the authors evaluate 16 state-of-the-art VLMs and analyze how different stress factors affect recognition, reasoning, planning, and localization-related behavior.

Novelty

The main novelty is a physically grounded robustness benchmark for VLMs in embodied scenes that defines stress using image-formation factors rather than generic digital perturbations. The work is also distinctive in pairing the benchmark with StressDART, a test-time detect-and-rectify pipeline that uses explicit stress diagnosis before reasoning.

Results

Across 16 evaluated VLMs, performance remains unsaturated under physical visual stress: the best overall accuracy reported is 58.1%, and strong commercial models score 44.8% and 46.2%. The analysis shows that scaling generally improves average performance but does not remove stress-specific weaknesses, with geometry stress particularly harmful for localization and spatial reasoning tasks. The proposed StressDART improves a Qwen3-VL-4B baseline from 43.2% to 49.0% when reasoning over original plus rectified images.

Key Points

  1. RoboStressBench organizes embodied-scene visual stress into four interpretable dimensions—Material, Viewpoint, Lighting, and Geometry—and supports both VQA and grounding evaluation.
  2. The benchmark reveals task-dependent failure modes: for example, geometry-related stress strongly degrades grounding and spatial reasoning, while material and lighting stress more often affect recognition and state understanding.
  3. StressDART provides a parameter-free test-time intervention that detects the dominant stressor and applies targeted rectification, yielding measurable robustness gains over the baseline model.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.