Related papers: Yume: An Interactive World Generation Model

Yume: An Interactive World Generation Model

URL: http://arxiv.org/abs/2507.17744v1
Date: Wed, 23 Jul 2025 17:57:09 GMT
Title: Yume: An Interactive World Generation Model
Authors: Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang,
Abstract summary: Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world.<n>Method creates a dynamic world from an input image and allows exploration of the world using keyboard actions.
Score: 38.818537395166835
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.

Related papers

Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling [51.40150411616207]
We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets.<n>LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data.<n>Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals.
arXiv Detail & Related papers (2026-03-04T19:36:08Z)
MAD: Motion Appearance Decoupling for efficient Driving World Models [94.40548866741791]
We propose an efficient adaptation framework that converts generalist video models into controllable driving world models.<n>Key idea is to decouple motion learning from appearance synthesis.<n>Scaling to LTX, our MAD-LTX model outperforms all open-source competitors.
arXiv Detail & Related papers (2026-01-14T12:52:23Z)
Learning to Generate Object Interactions with Physics-Guided Video Diffusion [28.191514920144456]
We introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects.<n>We propose a two-stage training strategy that gradually removes future motion supervision via object masks.<n>Experiments show that KineMask achieves strong improvements over recent models of comparable size.
arXiv Detail & Related papers (2025-10-02T17:56:46Z)
Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model [15.16063778402193]
Matrix-Game 2.0 is an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion.<n>It can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS.
arXiv Detail & Related papers (2025-08-18T15:28:53Z)
Yan: Foundational Interactive Video Generation [25.398980906541524]
Yan is a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing.<n>We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process.<n>We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text.
arXiv Detail & Related papers (2025-08-12T03:34:21Z)
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression [23.99292102237088]
We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics.<n>After post-training, this model can be used as a video simulator for evaluating policies and generating synthetic data.
arXiv Detail & Related papers (2025-02-06T18:38:26Z)
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z)
Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction.<n>By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process.<n> Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z)
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics [67.97235923372035]
We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics. At test time, given a single image and a sparse set of motion trajectories, Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions.
arXiv Detail & Related papers (2024-08-08T17:59:38Z)
IRASim: A Fine-Grained World Model for Robot Manipulation [24.591694756757278]
We present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details.<n>We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.
arXiv Detail & Related papers (2024-06-20T17:50:16Z)
Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation [26.292052071093945]
We propose an unsupervised method to generate videos from a single frame and a sparse motion input. Our trained model can generate unseen realistic object-to-object interactions. We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
arXiv Detail & Related papers (2023-06-06T19:50:02Z)
Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes. We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature. We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.