REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories
- URL: http://arxiv.org/abs/2512.00736v1
- Date: Sun, 30 Nov 2025 05:20:22 GMT
- Title: REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories
- Authors: Jacob Thompson, Emiliano Garcia-Lopez, Yonatan Bisk,
- Abstract summary: We introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for embodied spatial reasoning.<n> REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints.<n>Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans.
- Score: 19.741468026765062
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack this fundamental spatial reasoning capability, a critical limitation for embodied applications. To demonstrate these limitations and drive research, we introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for long-horizon embodied spatial reasoning. REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints. Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans. These findings highlight challenges MLLMs face in developing robust spatial representations from sequential visual input. Consequently, REM provides targeted metrics and diagnostics to foster improved spatial understanding in future models.
Related papers
- SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery [64.67498968405327]
SpatialDreamer is a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration.<n>GeoPO introduces tree-structured sampling and step-level reward estimation with consistency geometric constraints.
arXiv Detail & Related papers (2025-12-08T17:20:50Z) - LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z) - Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks [108.15756345836901]
We provide a comprehensive review of multimodal spatial reasoning tasks with large models.<n>We review advances in embodied AI, including vision-language navigation and action models.<n>We consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors.
arXiv Detail & Related papers (2025-10-29T17:55:43Z) - How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective [103.44502230776352]
We present a systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs)<n>We categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings.
arXiv Detail & Related papers (2025-09-23T12:00:14Z) - SpatialViz-Bench: An MLLM Benchmark for Spatial Visualization [44.427830927596204]
SpatialViz-Bench is a comprehensive benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems.<n>Our evaluation of 33 state-of-the-art MLLMs reveals wide performance variations and uncovers counter-intuitive findings.
arXiv Detail & Related papers (2025-07-10T10:27:20Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z) - Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z) - SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation [7.659514491338669]
Current vision-language models may grasp basic spatial cues but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications.<n>We develop SPHERE, a hierarchical evaluation framework supported by a new human-annotated dataset.<n> Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity.
arXiv Detail & Related papers (2024-12-17T09:10:55Z) - GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs [3.2688425993442696]
The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning.
We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding.
arXiv Detail & Related papers (2024-06-19T06:15:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.