Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
- URL: http://arxiv.org/abs/2505.17015v1
- Date: Thu, 22 May 2025 17:59:39 GMT
- Title: Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
- Authors: Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, Kevin J. Liang,
- Abstract summary: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images.<n>We propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception.<n>Our model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning.
- Score: 70.41727912081463
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.
Related papers
- EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery [15.581788175591097]
It is challenging to adapt natural spatial models to remote sensing imagery.<n>EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities.<n>Experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks.
arXiv Detail & Related papers (2025-04-17T09:56:35Z) - TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models [23.916205754112774]
Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks.<n>We propose TAMP, a simple yet effective pruning framework tailored for MLLMs.<n>We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities.
arXiv Detail & Related papers (2025-04-14T05:44:38Z) - PUMA: Empowering Unified MLLM with Multi-granular Visual Generation [62.747751204215916]
We propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation.
PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs.
This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks.
arXiv Detail & Related papers (2024-10-17T17:59:57Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.<n>SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.<n>We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images.
We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.