WORLDMEM: Long-term Consistent World Simulation with Memory
- URL: http://arxiv.org/abs/2504.12369v1
- Date: Wed, 16 Apr 2025 17:59:30 GMT
- Title: WORLDMEM: Long-term Consistent World Simulation with Memory
- Authors: Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan,
- Abstract summary: WorldMem is a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states.<n>Our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps.
- Score: 20.450750381415965
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.
Related papers
- Occupancy Learning with Spatiotemporal Memory [39.41175479685905]
We propose a scene-level occupancy representation learning framework that effectively learns 3D occupancy feature with temporal consistency.<n>Our method significantly enhances thetemporal representation learned for 3D occupancy prediction tasks by exploiting the temporal dependency between multi-frame inputs.
arXiv Detail & Related papers (2025-08-06T17:59:52Z) - GTAD: Global Temporal Aggregation Denoising Learning for 3D Semantic Occupancy Prediction [14.549066678968368]
We propose a global temporal aggregation denoising network named GTAD for holistic 3D scene understanding.<n>Our method employs an in-model latent denoising network to aggregate local temporal features from the current moment and global temporal features from historical sequences.
arXiv Detail & Related papers (2025-07-28T16:18:29Z) - Video World Models with Long-term Spatial Memory [110.530715838396]
We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
arXiv Detail & Related papers (2025-06-05T17:42:34Z) - 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model [83.70640091897947]
Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences.<n>Current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments.<n>We propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions.
arXiv Detail & Related papers (2025-05-28T17:59:13Z) - Long-Context State-Space Video World Models [66.28743632951218]
We propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency.<n>Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory.<n>Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory.
arXiv Detail & Related papers (2025-05-26T16:12:41Z) - Learning 3D Persistent Embodied World Models [84.40585374179037]
We introduce a new persistent embodied world model with an explicit memory of previously generated content.<n>During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent.<n>This generation is then aggregated into a persistent 3D map of the environment.
arXiv Detail & Related papers (2025-05-05T17:59:17Z) - LMPOcc: 3D Semantic Occupancy Prediction Utilizing Long-Term Memory Prior from Historical Traversals [4.970345700893879]
Longterm Memory Prior Occupancy (LMPOcc) is the first 3D occupancy prediction methodology that exploits long-term memory priors derived from historical perceptual outputs.
We introduce a plug-and-play architecture that integrates long-term memory priors to enhance local perception while simultaneously constructing global occupancy representations.
arXiv Detail & Related papers (2025-04-18T09:58:48Z) - LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs [55.81291976637705]
Large models (LMMs) excel in scene understanding but struggle with fine-temporal reasoning due to weak alignment between linguistic and visual representations.<n>Existing methods map textual positions and durations into the visual space from frame-based videos, but suffer from temporal sparsity that limits temporal coordination.<n>We introduce LFEA to leverage event cameras for temporally dense perception and frame-event fusion.
arXiv Detail & Related papers (2025-03-10T05:30:30Z) - Episodic Memories Generation and Evaluation Benchmark for Large Language Models [7.660368798066376]
We argue that integrating episodic memory capabilities into Large Language Models is essential for advancing AI towards human-like cognition.
We develop a structured approach to represent episodic events, encapsulating temporal and spatial contexts, involved entities, and detailed descriptions.
We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance.
arXiv Detail & Related papers (2025-01-21T02:16:13Z) - FACTS: A Factored State-Space Framework For World Modelling [24.08175276756845]
We propose a novel recurrent framework, the textbfFACTored textbfState-space (textbfFACTS) model, for spatial-temporal world modelling.<n>The FACTS framework constructs a graph-memory with a routing mechanism that learns permutable memory representations.<n>It consistently outperforms or matches specialised state-of-the-art models, despite its general-purpose world modelling design.
arXiv Detail & Related papers (2024-10-28T11:04:42Z) - Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term.
We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents.
Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z) - Spatially-Aware Transformer for Embodied Agents [20.498778205143477]
This paper explores the use of Spatially-Aware Transformer models that incorporate spatial information.
We demonstrate that memory utilization efficiency can be improved, leading to enhanced accuracy in various place-centric downstream tasks.
We also propose the Adaptive Memory Allocator, a memory management method based on reinforcement learning.
arXiv Detail & Related papers (2024-02-23T07:46:30Z) - Generalizing Event-Based Motion Deblurring in Real-World Scenarios [62.995994797897424]
Event-based motion deblurring has shown promising results by exploiting low-latency events.
We propose a scale-aware network that allows flexible input spatial scales and enables learning from different temporal scales of motion blur.
A two-stage self-supervised learning scheme is then developed to fit real-world data distribution.
arXiv Detail & Related papers (2023-08-11T04:27:29Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Efficient Regional Memory Network for Video Object Segmentation [56.587541750729045]
We propose a novel local-to-local matching solution for semi-supervised VOS, namely Regional Memory Network (RMNet)
The proposed RMNet effectively alleviates the ambiguity of similar objects in both memory and query frames.
Experimental results indicate that the proposed RMNet performs favorably against state-of-the-art methods on the DAVIS and YouTube-VOS datasets.
arXiv Detail & Related papers (2021-03-24T02:08:46Z) - HM4: Hidden Markov Model with Memory Management for Visual Place
Recognition [54.051025148533554]
We develop a Hidden Markov Model approach for visual place recognition in autonomous driving.
Our algorithm, dubbed HM$4$, exploits temporal look-ahead to transfer promising candidate images between passive storage and active memory.
We show that this allows constant time and space inference for a fixed coverage area.
arXiv Detail & Related papers (2020-11-01T08:49:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.