WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
- URL: http://arxiv.org/abs/2512.14614v1
- Date: Tue, 16 Dec 2025 17:22:46 GMT
- Title: WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
- Authors: Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo,
- Abstract summary: WorldPlay is a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency.<n>We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs.<n>We also propose Context Forcing, a novel distillation method designed for memory-aware model.
- Score: 34.486078065308995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.
Related papers
- Beyond Pixel Histories: World Models with Persistent 3D State [50.4601060508243]
PERSIST is a new paradigm of world model which simulates the evolution of a latent 3D scene.<n>We show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods.
arXiv Detail & Related papers (2026-03-03T19:58:31Z) - RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z) - WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling [42.52474988220278]
We propose WorldPack, a video world model with efficient compressed memory.<n>WorldPack significantly improves spatial consistency, fidelity, and quality in long-term generation.<n>Our performance is evaluated with LoopNav, a benchmark on Minecraft.
arXiv Detail & Related papers (2025-12-02T07:06:23Z) - MagicWorld: Interactive Geometry-driven Video World Exploration [30.475628938909242]
We propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval.<n>We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction.<n>We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals.
arXiv Detail & Related papers (2025-11-24T08:41:28Z) - MotionStream: Real-Time Video Generation with Interactive Motion Controls [60.403597895657505]
We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU.<n>Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly.<n>Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming.
arXiv Detail & Related papers (2025-11-03T06:37:53Z) - LONG3R: Long Sequence Streaming 3D Reconstruction [29.79885827038617]
Long3R is a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences.<n>Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation.<n>Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences.
arXiv Detail & Related papers (2025-07-24T09:55:20Z) - VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory [55.73900731190389]
We introduce Surfel-Indexed View Memory (VMem), a memory module that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed.<n>VMem enables efficient retrieval of the most relevant past views when generating new ones.
arXiv Detail & Related papers (2025-06-23T17:59:56Z) - Video World Models with Long-term Spatial Memory [110.530715838396]
We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
arXiv Detail & Related papers (2025-06-05T17:42:34Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.