Toward Memory-Aided World Models: Benchmarking via Spatial Consistency
- URL: http://arxiv.org/abs/2505.22976v1
- Date: Thu, 29 May 2025 01:28:57 GMT
- Title: Toward Memory-Aided World Models: Benchmarking via Spatial Consistency
- Authors: Kewei Lian, Shaofei Cai, Yilun Du, Yitao Liang,
- Abstract summary: A memory module is a crucial component for addressing spatial consistency.<n>There are no datasets designed to promote the development of memory modules by explicitly enforcing spatial consistency constraints.<n>We construct a dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft.
- Score: 30.871215294419343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to simulate the world in a spatially consistent manner is a crucial requirements for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. Designing a memory module is a crucial component for addressing spatial consistency: such a model must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, there are no dataset designed to promote the development of memory modules by explicitly enforcing spatial consistency constraints. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we construct a dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft, collecting about 250 hours (20 million frames) of loop-based navigation videos with actions. Our dataset follows a curriculum design of sequence lengths, allowing models to learn spatial consistency on increasingly complex navigation trajectories. Furthermore, our data collection pipeline is easily extensible to new Minecraft environments and modules. Four representative world model baselines are evaluated on our benchmark. Dataset, benchmark, and code are open-sourced to support future research.
Related papers
- TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z) - Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory [101.2076718776139]
We propose a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments.<n>We introduce a Pose-free Memory (HPMC) that distills historical latents into a fixed-budget geometric representation.<n>We also propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic.
arXiv Detail & Related papers (2026-02-02T17:52:56Z) - A multi-view contrastive learning framework for spatial embeddings in risk modelling [0.688204255655161]
spatial data are often unstructured, high-dimensional, and difficult to integrate into predictive models.<n>We propose a novel multi-view contrastive learning framework for generating spatial embeddings.<n>In a case study on French real estate prices, we compare models trained on raw coordinates against those using our spatial embeddings as inputs.
arXiv Detail & Related papers (2025-11-22T07:39:34Z) - CubeletWorld: A New Abstraction for Scalable 3D Modeling [9.828459640939004]
We introduce CubeletWorld, a framework for representing and analyzing urban environments through a discretized 3D grid of spatial units called cubelets.<n>This abstraction enables privacy-preserving modeling by embedding diverse data signals, such as infrastructure, movement, or environmental indicators, into localized cubelet states.<n>We propose the CubeletWorld State Prediction task, which involves predicting the cubelet state using a realistic dataset containing various urban elements.
arXiv Detail & Related papers (2025-11-21T00:19:02Z) - SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding [64.86119288520419]
multimodal language models struggle with spatial reasoning across time and space.<n>We present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators.<n>Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
arXiv Detail & Related papers (2025-11-06T18:53:31Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - TARDIS STRIDE: A Spatio-Temporal Road Image Dataset and World Model for Autonomy [44.85881816317044]
We show how to permuting 360-degree panoramic imagery into rich interconnected observation, state and action nodes.<n>We benchmark this dataset via TARDIS, a transformer-based generative world model.<n>We demonstrate robust performance across a range of agentic tasks such as controllable image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing.
arXiv Detail & Related papers (2025-06-12T21:08:11Z) - Video World Models with Long-term Spatial Memory [110.530715838396]
We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
arXiv Detail & Related papers (2025-06-05T17:42:34Z) - Linear Spatial World Models Emerge in Large Language Models [4.9185678564997355]
We investigate whether large language models implicitly encode linear spatial world models.<n>We introduce a formal framework for spatial world models and assess whether such structure emerges in contextual embeddings.<n>Our results provide empirical evidence that LLMs encode linear spatial world models.
arXiv Detail & Related papers (2025-06-03T15:31:00Z) - RemoteSAM: Towards Segment Anything for Earth Observation [29.707796048411705]
We aim to develop a robust yet flexible visual foundation model for Earth observation.<n>It should possess strong capabilities in recognizing and localizing diverse visual targets.<n>We present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks.
arXiv Detail & Related papers (2025-05-23T15:27:57Z) - Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control [97.98560001760126]
We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs.<n>We conduct evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics2Real and autonomous vehicle data enrichment.
arXiv Detail & Related papers (2025-03-18T17:57:54Z) - Trajectory World Models for Heterogeneous Environments [67.27233466954814]
Heterogeneity in sensors and actuators across environments poses a significant challenge to building large-scale pre-trained world models.<n>We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity.<n>We also propose TrajWorld, a novel architecture capable of handling varying sensor and actuator information and capturing environment dynamics in-context.
arXiv Detail & Related papers (2025-02-03T13:59:08Z) - EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling [8.250616459360684]
We introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models.<n>Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory-demanding benchmark, and 3D first-person ViZDoom environments.
arXiv Detail & Related papers (2025-02-01T15:49:59Z) - FACTS: A Factored State-Space Framework For World Modelling [24.08175276756845]
We propose a novel recurrent framework, the textbfFACTored textbfState-space (textbfFACTS) model, for spatial-temporal world modelling.<n>The FACTS framework constructs a graph-memory with a routing mechanism that learns permutable memory representations.<n>It consistently outperforms or matches specialised state-of-the-art models, despite its general-purpose world modelling design.
arXiv Detail & Related papers (2024-10-28T11:04:42Z) - REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships.
We develop the REVISION framework which improves spatial fidelity in vision-language models.
Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z) - Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching [60.645802236700035]
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets.
We introduce GeoText-1652, a new natural language-guided geo-localization benchmark.
This dataset is systematically constructed through an interactive human-computer process.
arXiv Detail & Related papers (2023-11-21T17:52:30Z) - A Simple Framework for Multi-mode Spatial-Temporal Data Modeling [4.855443906457102]
We propose a simple framework for multi-mode spatial-temporal data modeling to bring both effectiveness and efficiency together.
Specifically, we design a general cross-mode spatial relationships learning component to adaptively establish connections between multiple modes.
Experiments on three real-world datasets show that our model can consistently outperform the baselines with lower space and time complexity.
arXiv Detail & Related papers (2023-08-22T05:41:20Z) - Global Aggregation then Local Distribution for Scene Parsing [99.1095068574454]
We show that our approach can be modularized as an end-to-end trainable block and easily plugged into existing semantic segmentation networks.
Our approach allows us to build new state of the art on major semantic segmentation benchmarks including Cityscapes, ADE20K, Pascal Context, Camvid and COCO-stuff.
arXiv Detail & Related papers (2021-07-28T03:46:57Z) - S2RMs: Spatially Structured Recurrent Modules [105.0377129434636]
We take a step towards exploiting dynamic structure that are capable of simultaneously exploiting both modular andtemporal structures.
We find our models to be robust to the number of available views and better capable of generalization to novel tasks without additional training.
arXiv Detail & Related papers (2020-07-13T17:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.