RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes
Abstract Overview
RoboStressBench is a benchmark for evaluating vision-language model robustness to physical visual stress in embodied scenes, rather than only on clean images or synthetic corruptions. The paper formulates visual stress through an image-formation perspective and organizes it into four physically grounded dimensions: Material, Viewpoint, Lighting, and Geometry. The benchmark is built from filtered real cases, controlled stress synthesis, and additional real-world collection, yielding an approximately 7.2K-example dataset that supports both multiple-choice VQA and grounding tasks. Using this setup, the authors evaluate 16 state-of-the-art VLMs and analyze how different stress factors affect recognition, reasoning, planning, and localization-related behavior.
Novelty
The main novelty is a physically grounded robustness benchmark for VLMs in embodied scenes that defines stress using image-formation factors rather than generic digital perturbations. The work is also distinctive in pairing the benchmark with StressDART, a test-time detect-and-rectify pipeline that uses explicit stress diagnosis before reasoning.
Results
Across 16 evaluated VLMs, performance remains unsaturated under physical visual stress: the best overall accuracy reported is 58.1%, and strong commercial models score 44.8% and 46.2%. The analysis shows that scaling generally improves average performance but does not remove stress-specific weaknesses, with geometry stress particularly harmful for localization and spatial reasoning tasks. The proposed StressDART improves a Qwen3-VL-4B baseline from 43.2% to 49.0% when reasoning over original plus rectified images.
Key Points
- RoboStressBench organizes embodied-scene visual stress into four interpretable dimensions—Material, Viewpoint, Lighting, and Geometry—and supports both VQA and grounding evaluation.
- The benchmark reveals task-dependent failure modes: for example, geometry-related stress strongly degrades grounding and spatial reasoning, while material and lighting stress more often affect recognition and state understanding.
- StressDART provides a parameter-free test-time intervention that detects the dominant stressor and applies targeted rectification, yielding measurable robustness gains over the baseline model.