From Perception to Action: An Interactive Benchmark for Vision Reasoning
- URL: http://arxiv.org/abs/2602.21015v1
- Date: Tue, 24 Feb 2026 15:33:02 GMT
- Title: From Perception to Action: An Interactive Benchmark for Vision Reasoning
- Authors: Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dylan Raharja, Soujanya Poria, Roy Ka-Wei Lee,
- Abstract summary: Causal Hierarchy of Actions and Interactions (CHAIN) benchmark designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints.<n> CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing.<n>Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.
- Score: 51.11355591375073
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.
Related papers
- Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing [20.40288070674112]
We propose an end-to-end Interaction-aware Transformer (InterFormer)<n>It integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss.<n>Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets.
arXiv Detail & Related papers (2026-02-24T06:39:18Z) - Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models [57.71440995598757]
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models.<n>Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world.
arXiv Detail & Related papers (2025-12-15T18:03:42Z) - PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention [92.85371254435074]
PosA-VLA framework anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions.<n>We show that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks.
arXiv Detail & Related papers (2025-12-03T12:14:29Z) - Video Spatial Reasoning with Object-Centric 3D Rollout [58.12446467377404]
We propose Object-Centric 3D Rollout (OCR) to enable robust video spatial reasoning.<n>OCR introduces structured perturbations to the 3D geometry of selected objects during training.<n>OCR compels the model to reason holistically across the entire scene.
arXiv Detail & Related papers (2025-11-17T09:53:41Z) - Action-Dynamics Modeling and Cross-Temporal Interaction for Online Action Understanding [23.87664450145037]
Action understanding, encompassing action detection and anticipation, plays a crucial role in numerous practical applications.<n>We propose a novel framework called the State-Specific Model (SSM), designed to unify and enhance both action detection and anticipation tasks.
arXiv Detail & Related papers (2025-10-12T16:10:40Z) - PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly [77.33429729761596]
We introduce PhyBlock, a progressive benchmark to assess vision-language models (VLMs) on physical understanding and planning.<n>PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples.<n>We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning.
arXiv Detail & Related papers (2025-06-10T11:46:06Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [89.77871049500546]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics [31.819336585007104]
We propose to leverage superquadrics as an alternative 3D object representation to bounding boxes.<n>We demonstrate their effectiveness on both template-free object reconstruction and action recognition tasks.<n>We also study the compositionality of actions by considering a more challenging task where the training combinations of verbs and nouns do not overlap with the testing split.
arXiv Detail & Related papers (2025-01-13T07:26:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.