RELIC: Interactive Video World Model with Long-Horizon Memory
- URL: http://arxiv.org/abs/2512.04040v1
- Date: Wed, 03 Dec 2025 18:29:20 GMT
- Title: RELIC: Interactive Video World Model with Long-Horizon Memory
- Authors: Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan,
- Abstract summary: A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
- Score: 74.81433479334821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.
Related papers
- Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory [101.2076718776139]
We propose a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments.<n>We introduce a Pose-free Memory (HPMC) that distills historical latents into a fixed-budget geometric representation.<n>We also propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic.
arXiv Detail & Related papers (2026-02-02T17:52:56Z) - Spatia: Video Generation with Updatable Spatial Memory [60.21619361473996]
Spatia is a spatial memory-aware video generation framework that preserves a 3D scene point cloud as persistent spatial memory.<n>Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM.<n>Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
arXiv Detail & Related papers (2025-12-17T18:59:59Z) - WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling [34.486078065308995]
WorldPlay is a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency.<n>We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs.<n>We also propose Context Forcing, a novel distillation method designed for memory-aware model.
arXiv Detail & Related papers (2025-12-16T17:22:46Z) - VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory [42.2374676860638]
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally.<n>Maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition.<n>We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory.
arXiv Detail & Related papers (2025-12-04T07:06:02Z) - Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft [45.363427511806385]
Memory Forcing is a learning framework that pairs training protocols with a geometry-indexed spatial memory.<n>We show that Memory Forcing achieves superior long-term spatial consistency and generative quality across diverse environments.
arXiv Detail & Related papers (2025-10-03T17:35:16Z) - Pack and Force Your Memory: Long-form and Consistent Video Generation [26.53691150499802]
Long-form video generation presents a dual challenge.<n>Models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding.<n>MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation.
arXiv Detail & Related papers (2025-10-02T08:22:46Z) - LONG3R: Long Sequence Streaming 3D Reconstruction [29.79885827038617]
Long3R is a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences.<n>Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation.<n>Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences.
arXiv Detail & Related papers (2025-07-24T09:55:20Z) - VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory [55.73900731190389]
We introduce Surfel-Indexed View Memory (VMem), a memory module that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed.<n>VMem enables efficient retrieval of the most relevant past views when generating new ones.
arXiv Detail & Related papers (2025-06-23T17:59:56Z) - Video World Models with Long-term Spatial Memory [110.530715838396]
We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
arXiv Detail & Related papers (2025-06-05T17:42:34Z) - 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model [83.70640091897947]
Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences.<n>Current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments.<n>We propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions.
arXiv Detail & Related papers (2025-05-28T17:59:13Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.