Video World Models with Long-term Spatial Memory
- URL: http://arxiv.org/abs/2506.05284v1
- Date: Thu, 05 Jun 2025 17:42:34 GMT
- Title: Video World Models with Long-term Spatial Memory
- Authors: Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein,
- Abstract summary: We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
- Score: 110.530715838396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.
Related papers
- Beyond Pixel Histories: World Models with Persistent 3D State [50.4601060508243]
PERSIST is a new paradigm of world model which simulates the evolution of a latent 3D scene.<n>We show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods.
arXiv Detail & Related papers (2026-03-03T19:58:31Z) - AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories [78.78355829813793]
Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history.<n>We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories.<n>Experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality.
arXiv Detail & Related papers (2026-02-16T17:23:08Z) - Spatia: Video Generation with Updatable Spatial Memory [60.21619361473996]
Spatia is a spatial memory-aware video generation framework that preserves a 3D scene point cloud as persistent spatial memory.<n>Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM.<n>Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
arXiv Detail & Related papers (2025-12-17T18:59:59Z) - VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory [42.2374676860638]
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally.<n>Maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition.<n>We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory.
arXiv Detail & Related papers (2025-12-04T07:06:02Z) - RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z) - WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling [42.52474988220278]
We propose WorldPack, a video world model with efficient compressed memory.<n>WorldPack significantly improves spatial consistency, fidelity, and quality in long-term generation.<n>Our performance is evaluated with LoopNav, a benchmark on Minecraft.
arXiv Detail & Related papers (2025-12-02T07:06:23Z) - Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft [45.363427511806385]
Memory Forcing is a learning framework that pairs training protocols with a geometry-indexed spatial memory.<n>We show that Memory Forcing achieves superior long-term spatial consistency and generative quality across diverse environments.
arXiv Detail & Related papers (2025-10-03T17:35:16Z) - VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory [55.73900731190389]
We introduce Surfel-Indexed View Memory (VMem), a mechanism that remembers past views by indexing them geometrically based on the 3D surface elements they have observed.<n>VMem enables the efficient retrieval of the most relevant past views when generating new ones.<n>We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.
arXiv Detail & Related papers (2025-06-23T17:59:56Z) - Long-Context State-Space Video World Models [66.28743632951218]
We propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency.<n>Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory.<n>Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory.
arXiv Detail & Related papers (2025-05-26T16:12:41Z) - Learning 3D Persistent Embodied World Models [84.40585374179037]
We introduce a new persistent embodied world model with an explicit memory of previously generated content.<n>During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent.<n>This generation is then aggregated into a persistent 3D map of the environment.
arXiv Detail & Related papers (2025-05-05T17:59:17Z) - WORLDMEM: Long-term Consistent World Simulation with Memory [20.450750381415965]
WorldMem is a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states.<n>Our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps.
arXiv Detail & Related papers (2025-04-16T17:59:30Z) - FACTS: A Factored State-Space Framework For World Modelling [24.08175276756845]
We propose a novel recurrent framework, the textbfFACTored textbfState-space (textbfFACTS) model, for spatial-temporal world modelling.<n>The FACTS framework constructs a graph-memory with a routing mechanism that learns permutable memory representations.<n>It consistently outperforms or matches specialised state-of-the-art models, despite its general-purpose world modelling design.
arXiv Detail & Related papers (2024-10-28T11:04:42Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - Video Dehazing via a Multi-Range Temporal Alignment Network with
Physical Prior [117.6741444489174]
Video dehazing aims to recover haze-free frames with high visibility and contrast.
This paper presents a novel framework to explore the physical haze priors and aggregate temporal information.
We construct the first large-scale outdoor video dehazing benchmark dataset.
arXiv Detail & Related papers (2023-03-17T03:44:17Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.