EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
- URL: http://arxiv.org/abs/2512.15160v1
- Date: Wed, 17 Dec 2025 07:51:36 GMT
- Title: EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
- Authors: Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Hong Zhang, Ding Yuan, Yifan Yang,
- Abstract summary: spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or MLLMs with black-box reconstruction modules.<n>We present EagleVision, a framework for progressive spatial cognition through macro perception and micro verification.
- Score: 10.889641815961133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for "thinking with images" (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.
Related papers
- Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation [41.434638833165494]
Allocentric Perceiver is a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts.<n>Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation.
arXiv Detail & Related papers (2026-02-05T15:45:39Z) - Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation [52.605647992080485]
spatial reasoning advances vision-language models from visual perception toward semantic understanding.<n>We integrate the cognitive concept of an object-centric blueprint into spatial reasoning.<n>Our method consistently outperforms existing vision-language models.
arXiv Detail & Related papers (2026-01-05T10:38:26Z) - CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning [48.36177110428022]
We present a central-peripheral vision-inspired framework (CVP) for spatial reasoning.<n>CVP draws inspiration from the two types of human visual fields -- central vision and peripheral vision.<n> Experiments show that CVP achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.
arXiv Detail & Related papers (2025-12-09T00:21:13Z) - Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens [54.18057944158818]
Chain-of-Visual-Thought (COVT) is a framework that enables Vision-Language Models (VLMs) to reason through continuous visual tokens.<n>Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts.<n>During training, the VLM with COVT autoregressively predicts visual tokens to reconstruct dense supervision signals.
arXiv Detail & Related papers (2025-11-24T18:55:19Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model [33.18304419115947]
We introduce SEE&TREK, the first training-free prompting framework to enhance the spatial understanding of Multimodal Large Language Models (MLLM) under vision-only constraints.<n>We focus on increasing visual diversity and motion reconstruction.<n>Our method is training&GPU-free, requiring only a single forward pass, and can be seamlessly integrated into existing MLLMS.
arXiv Detail & Related papers (2025-09-19T15:30:26Z) - Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence [13.168559963356952]
We present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations.<n>Our key insight is to unleash the strong structure prior to the feed-forward visual geometry foundation model.<n>A connector then integrates both features into unified visual tokens for enhanced spatial understanding.
arXiv Detail & Related papers (2025-05-29T17:59:04Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships.
We develop the REVISION framework which improves spatial fidelity in vision-language models.
Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.