Related papers: WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling

WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling

URL: http://arxiv.org/abs/2512.02473v1
Date: Tue, 02 Dec 2025 07:06:23 GMT
Title: WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling
Authors: Yuta Oshima, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta,
Abstract summary: We propose WorldPack, a video world model with efficient compressed memory.<n>WorldPack significantly improves spatial consistency, fidelity, and quality in long-term generation.<n>Our performance is evaluated with LoopNav, a benchmark on Minecraft.
Score: 42.52474988220278
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.

Related papers

Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory [101.2076718776139]
We propose a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments.<n>We introduce a Pose-free Memory (HPMC) that distills historical latents into a fixed-budget geometric representation.<n>We also propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic.
arXiv Detail & Related papers (2026-02-02T17:52:56Z)
RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z)
SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models [42.814012901180774]
textbfSAMPO is a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation.<n>We show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control.<n>We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks.
arXiv Detail & Related papers (2025-09-19T02:41:37Z)
Video World Models with Long-term Spatial Memory [110.530715838396]
We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
arXiv Detail & Related papers (2025-06-05T17:42:34Z)
Toward Memory-Aided World Models: Benchmarking via Spatial Consistency [30.871215294419343]
A memory module is a crucial component for addressing spatial consistency.<n>There are no datasets designed to promote the development of memory modules by explicitly enforcing spatial consistency constraints.<n>We construct a dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft.
arXiv Detail & Related papers (2025-05-29T01:28:57Z)
StateSpaceDiffuser: Bringing Long Context to Diffusion World Models [52.92249035412797]
We introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model.<n>This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models.
arXiv Detail & Related papers (2025-05-28T11:27:54Z)
Long-Context State-Space Video World Models [66.28743632951218]
We propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency.<n>Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory.<n>Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory.
arXiv Detail & Related papers (2025-05-26T16:12:41Z)
WORLDMEM: Long-term Consistent World Simulation with Memory [20.450750381415965]
WorldMem is a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states.<n>Our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps.
arXiv Detail & Related papers (2025-04-16T17:59:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.