Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
- URL: http://arxiv.org/abs/2509.06266v2
- Date: Tue, 30 Sep 2025 04:28:17 GMT
- Title: Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
- Authors: Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, Mohammad Akbari,
- Abstract summary: Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs)<n>We introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data.<n>We propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs.
- Score: 14.268621981134293
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.
Related papers
- Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - MindJourney: Test-Time Scaling with World Models for Spatial Reasoning [82.46482433335535]
spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation.<n>We propose MindJourney, a test-time scaling framework that grants a vision-language model with this missing capability.<n>We show that MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT.
arXiv Detail & Related papers (2025-07-16T17:59:36Z) - EgoVLM: Policy Optimization for Egocentric Video Understanding [2.397572703240721]
We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning.<n>EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps.<n>Our EgoVLMB, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the Ego benchmark, respectively.
arXiv Detail & Related papers (2025-06-03T17:28:00Z) - Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames [17.975173937253494]
An embodied AI assistant operating on egocentric video must integrate spatial cues across time.<n>Disjoint-3DQA is a generative QA benchmark that evaluates this ability of VLMs.
arXiv Detail & Related papers (2025-05-30T06:32:26Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [47.237216851265316]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding.<n>We automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D based on human-annotated data.<n>We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z) - AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding [44.79843213164787]
Embodied AI personal assistants require embodied understanding to collaborate with humans effectively.
Current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric experience.
We introduce the Egocentric Video Understanding dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos.
We present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD.
arXiv Detail & Related papers (2024-06-19T20:14:14Z) - SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics.
We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images.
By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.