LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
- URL: http://arxiv.org/abs/2511.19261v1
- Date: Mon, 24 Nov 2025 16:13:26 GMT
- Title: LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
- Authors: Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo, Jiaheng Wei,
- Abstract summary: We propose LAST, short for LeArn to Think in Space and Time, to improve 3D spatial and long video understanding for general vision-language models.<n>We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks.
- Score: 50.50563228383038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.
Related papers
- Think3D: Thinking with Space for Spatial Reasoning [54.518667686880114]
We introduce Think3D, a framework that enables vision large models (VLMs) to think with 3D space.<n>Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models.<n>Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents.
arXiv Detail & Related papers (2026-01-19T13:13:54Z) - G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning [36.62798449863548]
Vision-Language Models (VLMs) still lack robustness in spatial intelligence.<n>We present G$2$VLM, a vision-language model that bridges two fundamental aspects of spatial intelligence.
arXiv Detail & Related papers (2025-11-26T18:59:39Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - MindJourney: Test-Time Scaling with World Models for Spatial Reasoning [97.61985090279961]
We propose MindJourney, a test-time scaling framework for vision-language models.<n>We show that MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT.<n>Our method also improves upon the test-time inference VLMs trained through reinforcement learning.
arXiv Detail & Related papers (2025-07-16T17:59:36Z) - Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors [24.261272070476934]
Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos.<n>We propose a novel and efficient method called the Video-3D Geometry Large Language Model (VG LLM)<n>Our approach utilizes a 3D visual geometry encoder to extract 3D prior information from video sequences.
arXiv Detail & Related papers (2025-05-30T14:16:41Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models [39.488763757826426]
2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks.<n>Recent advances have leveraged 3D point clouds and multi-view images as inputs, yielding promising results.<n>We propose a vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding.
arXiv Detail & Related papers (2025-01-02T18:59:59Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - Multi-View Transformer for 3D Visual Grounding [64.30493173825234]
We propose a Multi-View Transformer (MVT) for 3D visual grounding.
We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together.
arXiv Detail & Related papers (2022-04-05T12:59:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.