Related papers: RELIC: Interactive Video World Model with Long-Horizon Memory

RELIC: Interactive Video World Model with Long-Horizon Memory

URL: http://arxiv.org/abs/2512.04040v1
Date: Wed, 03 Dec 2025 18:29:20 GMT
Title: RELIC: Interactive Video World Model with Long-Horizon Memory
Authors: Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan,
Abstract summary: A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
Score: 74.81433479334821
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.

Related papers

Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory [101.2076718776139]
We propose a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments.<n>We introduce a Pose-free Memory (HPMC) that distills historical latents into a fixed-budget geometric representation.<n>We also propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic.
arXiv Detail & Related papers (2026-02-02T17:52:56Z)
Spatia: Video Generation with Updatable Spatial Memory [60.21619361473996]
Spatia is a spatial memory-aware video generation framework that preserves a 3D scene point cloud as persistent spatial memory.<n>Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM.<n>Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
arXiv Detail & Related papers (2025-12-17T18:59:59Z)
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling [34.486078065308995]
WorldPlay is a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency.<n>We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs.<n>We also propose Context Forcing, a novel distillation method designed for memory-aware model.
arXiv Detail & Related papers (2025-12-16T17:22:46Z)
VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory [42.2374676860638]
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally.<n>Maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition.<n>We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory.
arXiv Detail & Related papers (2025-12-04T07:06:02Z)
Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft [45.363427511806385]
Memory Forcing is a learning framework that pairs training protocols with a geometry-indexed spatial memory.<n>We show that Memory Forcing achieves superior long-term spatial consistency and generative quality across diverse environments.
arXiv Detail & Related papers (2025-10-03T17:35:16Z)
Pack and Force Your Memory: Long-form and Consistent Video Generation [26.53691150499802]
Long-form video generation presents a dual challenge.<n>Models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding.<n>MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation.
arXiv Detail & Related papers (2025-10-02T08:22:46Z)
LONG3R: Long Sequence Streaming 3D Reconstruction [29.79885827038617]
Long3R is a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences.<n>Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation.<n>Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences.
arXiv Detail & Related papers (2025-07-24T09:55:20Z)
VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory [55.73900731190389]
We introduce Surfel-Indexed View Memory (VMem), a memory module that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed.<n>VMem enables efficient retrieval of the most relevant past views when generating new ones.
arXiv Detail & Related papers (2025-06-23T17:59:56Z)
Video World Models with Long-term Spatial Memory [110.530715838396]
We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
arXiv Detail & Related papers (2025-06-05T17:42:34Z)
3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model [83.70640091897947]
Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences.<n>Current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments.<n>We propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions.
arXiv Detail & Related papers (2025-05-28T17:59:13Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.