MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence
- URL: http://arxiv.org/abs/2603.00515v1
- Date: Sat, 28 Feb 2026 07:23:36 GMT
- Title: MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence
- Authors: Xingyilang Yin, Chengzhengxu Li, Jiahao Chang, Chi-Man Pun, Xiaodong Cun,
- Abstract summary: Humans are born with vision-based 4D spatial-temporal intelligence.<n>Despite its importance, this capability remains a significant bottleneck for current large language models (MLLMs)
- Score: 50.11889361459544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: https://github.com/GVCLab/MLLM-4D.
Related papers
- Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models [79.18306680174011]
DSR Suite bridges gap across aspects of dataset, benchmark and model.<n>We propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR.<n>The pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories.
arXiv Detail & Related papers (2025-12-23T17:56:36Z) - S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance [20.55536735670125]
3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions.<n>Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG.<n>We propose S$2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning.
arXiv Detail & Related papers (2025-12-01T03:08:34Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding [23.446113957661503]
We introduce B4DL, a benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding.<n>We propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding.
arXiv Detail & Related papers (2025-08-07T11:11:56Z) - Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence [13.168559963356952]
We present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations.<n>Our key insight is to unleash the strong structure prior to the feed-forward visual geometry foundation model.<n>A connector then integrates both features into unified visual tokens for enhanced spatial understanding.
arXiv Detail & Related papers (2025-05-29T17:59:04Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding [83.37551035659119]
There are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects.<n>We introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding.
arXiv Detail & Related papers (2025-03-22T17:55:53Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.