Related papers: Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

URL: http://arxiv.org/abs/2412.14171v1
Date: Wed, 18 Dec 2024 18:59:54 GMT
Title: Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Authors: Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, Saining Xie,
Abstract summary: We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs.<n>We find that Multimodal Large Language Models (MLLMs) exhibit competitive - though subhuman - visual-spatial intelligence.
Score: 34.809309396448654
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

Related papers

Spatial Mental Modeling from Limited Views [71.57140964322559]
Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap.<n>Using MindCube, we evaluate how well Vision Language Models (VLMs) build robust spatial mental models.<n>We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps.
arXiv Detail & Related papers (2025-06-26T16:38:19Z)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning. We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z)
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos [8.279721795956124]
Humans excel at spatial-temporal reasoning, effortlessly interpreting dynamic visual events from an egocentric viewpoint. This paper explores multimodal spatial-temporal reasoning from an egocentric perspective, aiming to equip MLLMs with human-like reasoning capabilities.
arXiv Detail & Related papers (2025-03-16T15:24:11Z)
Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space [41.18548960865975]
We propose a novel benchmark, Open3DVQA, to comprehensively evaluate the spatial reasoning capacities of state-of-the-art (SOTA) foundation models in open 3D space. Open3DVQA consists of 9k VQA samples, collected using an efficient semi-automated tool in a high-fidelity urban simulator.
arXiv Detail & Related papers (2025-03-14T05:35:38Z)
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT) It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z)
SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models [78.06537464850538]
We show that simulations are surprisingly effective at imparting spatial aptitudes that translate to real images. We show that perfect annotations in simulation are more effective than existing approaches of pseudo-annotating real images.
arXiv Detail & Related papers (2024-12-10T18:52:45Z)
Does Spatial Cognition Emerge in Frontier Models? [56.47912101304053]
We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals.
arXiv Detail & Related papers (2024-10-09T01:41:49Z)
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z)
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models [37.44286562901589]
We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature.
arXiv Detail & Related papers (2024-06-21T03:53:37Z)
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces. VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z)
Things not Written in Text: Exploring Spatial Commonsense from Visual Signals [77.46233234061758]
We investigate whether models with visual signals learn more spatial commonsense than text-based models. We propose a benchmark that focuses on the relative scales of objects, and the positional relationship between people and objects under different actions. We find that image synthesis models are more capable of learning accurate and consistent spatial knowledge than other models.
arXiv Detail & Related papers (2022-03-15T17:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.