Yume-1.5: A Text-Controlled Interactive World Generation Model
- URL: http://arxiv.org/abs/2512.22096v1
- Date: Fri, 26 Dec 2025 17:52:49 GMT
- Title: Yume-1.5: A Text-Controlled Interactive World Generation Model
- Authors: Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang,
- Abstract summary: method is a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt.<n>method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds.
- Score: 78.93049063633084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.
Related papers
- Any4D: Open-Prompt 4D Generation from Natural Language and Images [7.541641344819342]
We propose bfPrimitive Embodied World Models (PEWM), which restricts video generation to shorter horizons.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-11-24T04:17:26Z) - Text-guided Visual Prompt DINO for Generic Segmentation [31.33676182634522]
We propose Prompt-DINO, a text-guided visual Prompt DINO framework.<n>First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features.<n>Second, we design order-aligned query selection for DETR-based architectures.<n>Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model.
arXiv Detail & Related papers (2025-08-08T09:09:30Z) - PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models [73.4445896872942]
PacTure is a framework for generating physically-based rendering (PBR) material textures from an un-domain 3D mesh.<n>We introduce view packing, a novel technique that increases the effective resolution for each view.
arXiv Detail & Related papers (2025-05-28T14:23:30Z) - STRICT: Stress Test of Rendering Images Containing Text [14.124910427202273]
$textbfSTRICT$ is a benchmark designed to stress-test the ability of diffusion models to render coherent and instruction-aligned text in images.<n>We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities.
arXiv Detail & Related papers (2025-05-25T05:37:08Z) - RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery [69.41989381702858]
Existing methods, such as direct generation and multi-agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency.<n>We propose RAPID, an efficient retrieval-augmented long text generation framework.<n>Our work provides a robust and efficient solution to the challenges of automated long-text generation.
arXiv Detail & Related papers (2025-03-02T06:11:29Z) - Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Event Transition Planning for Open-ended Text Generation [55.729259805477376]
Open-ended text generation tasks require models to generate a coherent continuation given limited preceding context.
We propose a novel two-stage method which explicitly arranges the ensuing events in open-ended text generation.
Our approach can be understood as a specially-trained coarse-to-fine algorithm.
arXiv Detail & Related papers (2022-04-20T13:37:51Z) - PLANET: Dynamic Content Planning in Autoregressive Transformers for
Long-form Text Generation [47.97523895218194]
We propose a novel generation framework leveraging autoregressive self-attention mechanism to conduct content planning and surface realization dynamically.
Our framework enriches the Transformer decoder with latent representations to maintain sentence-level semantic plans grounded by bag-of-words.
arXiv Detail & Related papers (2022-03-17T05:52:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.