Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
- URL: http://arxiv.org/abs/2602.07055v1
- Date: Wed, 04 Feb 2026 19:06:40 GMT
- Title: Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
- Authors: Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Manling Li,
- Abstract summary: Theory of Space is defined as an agent's ability to actively acquire information through self-directed, active exploration.<n>A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step.<n>Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.
- Score: 83.13508919229939
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.
Related papers
- Spatial Causal Prediction in Video [56.22332198384257]
We introduce a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes.<n>We construct a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions.<n>Through comprehensive experiments on 23 state-of-the-art models, we reveal substantial gaps between human and model performance limited temporal extrapolation, and weak causal grounding.
arXiv Detail & Related papers (2026-03-04T11:09:39Z) - Temporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards [39.328230174948025]
We propose an exploration method that leverages temporal contrastive representations to guide exploration.<n>We demonstrate that such representations can enable the learning of complex exploratory x in locomotion, manipulation, and embodied-AI tasks.
arXiv Detail & Related papers (2026-03-02T15:55:27Z) - Exploration Through Introspection: A Self-Aware Reward Model [0.0]
Evidence points to a unified system for self- and other-awareness.<n>We explore this self-awareness by having reinforcement learning agents infer their own internal states in gridworld environments.
arXiv Detail & Related papers (2026-01-06T19:53:33Z) - EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding [56.89359230139883]
We introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning and Intent-Driven Reasoning.<n>We present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series)<n>It is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes.
arXiv Detail & Related papers (2026-01-04T14:42:39Z) - Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks [108.15756345836901]
We provide a comprehensive review of multimodal spatial reasoning tasks with large models.<n>We review advances in embodied AI, including vision-language navigation and action models.<n>We consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors.
arXiv Detail & Related papers (2025-10-29T17:55:43Z) - How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective [103.44502230776352]
We present a systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs)<n>We categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings.
arXiv Detail & Related papers (2025-09-23T12:00:14Z) - Fostering Intrinsic Motivation in Reinforcement Learning with Pretrained Foundation Models [8.255197802529118]
Recent rise of foundation models, such as CLIP, offers opportunity to leverage pretrained, semantically rich embeddings.
Introductory modules can effectively utilize full state information, significantly increasing sample efficiency.
We show that embeddings provided by foundation models are sometimes even better than those constructed by the agent during training.
arXiv Detail & Related papers (2024-10-09T20:05:45Z) - H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.