GameGen-X: Interactive Open-world Game Video Generation
- URL: http://arxiv.org/abs/2411.00769v3
- Date: Fri, 06 Dec 2024 13:09:43 GMT
- Title: GameGen-X: Interactive Open-world Game Video Generation
- Authors: Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, Hao Chen,
- Abstract summary: We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos.
It simulates an array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events.
It provides interactive controllability, predicting and future altering content based on the current clip, thus allowing for gameplay simulation.
- Score: 10.001128258269675
- License:
- Abstract: We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over a million diverse gameplay video clips sampling from over 150 games with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of foundation model pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate game-related multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated video content.
Related papers
- GameFactory: Creating New Games with Generative Interactive Videos [32.98135338530966]
We present GameFactory, a framework focused on exploring scene generalization in game video generation.
We propose a multi-phase training strategy that decouples game style learning from action control, preserving open-domain generalization.
We extend our framework to enable autoregressive action-controllable game video generation, allowing the production of unlimited-length interactive game videos.
arXiv Detail & Related papers (2025-01-14T18:57:21Z) - Interactive Scene Authoring with Specialized Generative Primitives [25.378818867764323]
Specialized Generative Primitives is a generative framework that allows non-expert users to author high-quality 3D scenes.
Each primitive is an efficient generative model that captures the distribution of a single exemplar from the real world.
We showcase interactive sessions where various primitives are extracted from real-world scenes and controlled to create 3D assets and scenes in a few minutes.
arXiv Detail & Related papers (2024-12-20T04:39:50Z) - Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction.
By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process.
Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z) - Unbounded: A Generative Infinite Game of Character Life Simulation [68.37260000219479]
We introduce the concept of a generative infinite game, a video game that transcends the traditional boundaries of finite, hard-coded systems by using generative models.
We leverage recent advances in generative AI to create Unbounded: a game of character life simulation that is fully encapsulated in generative models.
arXiv Detail & Related papers (2024-10-24T17:59:31Z) - WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Enabling Visual Composition and Animation in Unsupervised Video Generation [42.475807996071175]
We call our model CAGE for visual Composition and Animation for video GEneration.
We conduct a series of experiments to demonstrate capabilities of CAGE in various settings.
arXiv Detail & Related papers (2024-03-21T12:50:15Z) - Probabilistic Adaptation of Text-to-Video Models [181.84311524681536]
Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model.
Video Adapter is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
arXiv Detail & Related papers (2023-06-02T19:00:17Z) - Multi-Game Decision Transformers [49.257185338595434]
We show that a single transformer-based model can play a suite of up to 46 Atari games simultaneously at close-to-human performance.
We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning.
We find that our Multi-Game Decision Transformer models offer the best scalability and performance.
arXiv Detail & Related papers (2022-05-30T16:55:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.