WorldDreamer: Towards General World Models for Video Generation via
Predicting Masked Tokens
- URL: http://arxiv.org/abs/2401.09985v1
- Date: Thu, 18 Jan 2024 14:01:20 GMT
- Title: WorldDreamer: Towards General World Models for Video Generation via
Predicting Masked Tokens
- Authors: Xiaofeng Wang, Zheng Zhu, Guan Huang, Boyuan Wang, Xinze Chen, Jiwen
Lu
- Abstract summary: We introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions.
WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge.
Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments.
- Score: 75.02160668328425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models play a crucial role in understanding and predicting the dynamics
of the world, which is essential for video generation. However, existing world
models are confined to specific scenarios such as gaming or driving, limiting
their ability to capture the complexity of general world dynamic environments.
Therefore, we introduce WorldDreamer, a pioneering world model to foster a
comprehensive comprehension of general world physics and motions, which
significantly enhances the capabilities of video generation. Drawing
inspiration from the success of large language models, WorldDreamer frames
world modeling as an unsupervised visual sequence modeling challenge. This is
achieved by mapping visual inputs to discrete tokens and predicting the masked
ones. During this process, we incorporate multi-modal prompts to facilitate
interaction within the world model. Our experiments show that WorldDreamer
excels in generating videos across different scenarios, including natural
scenes and driving environments. WorldDreamer showcases versatility in
executing tasks such as text-to-video conversion, image-tovideo synthesis, and
video editing. These results underscore WorldDreamer's effectiveness in
capturing dynamic elements within diverse general world environments.
Related papers
- DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model [65.43473733967038]
We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics.
Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge.
arXiv Detail & Related papers (2024-10-14T17:19:23Z) - Pandora: Towards General World Model with Natural Language Actions and Video States [61.30962762314734]
Pandora is a hybrid autoregressive-diffusion model that simulates world states by generating videos and allows real-time control with free-text actions.
Pandora achieves domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning.
arXiv Detail & Related papers (2024-06-12T18:55:51Z) - iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making.
This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens.
iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z) - DriveDreamer: Towards Real-world-driven World Models for Autonomous
Driving [76.24483706445298]
We introduce DriveDreamer, a world model entirely derived from real-world driving scenarios.
In the initial phase, DriveDreamer acquires a deep understanding of structured traffic constraints, while the subsequent stage equips it with the ability to anticipate future states.
DriveDreamer enables the generation of realistic and reasonable driving policies, opening avenues for interaction and practical applications.
arXiv Detail & Related papers (2023-09-18T13:58:42Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.