Genie: Generative Interactive Environments
- URL: http://arxiv.org/abs/2402.15391v1
- Date: Fri, 23 Feb 2024 15:47:26 GMT
- Title: Genie: Generative Interactive Environments
- Authors: Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge
Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris
Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas
Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei
Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, Tim
Rockt\"aschel
- Abstract summary: We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos.
The model can be prompted to generate a variety of action-controllable virtual worlds described through text, synthetic images, and even sketches.
- Score: 44.65662949794694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Genie, the first generative interactive environment trained in
an unsupervised manner from unlabelled Internet videos. The model can be
prompted to generate an endless variety of action-controllable virtual worlds
described through text, synthetic images, photographs, and even sketches. At
11B parameters, Genie can be considered a foundation world model. It is
comprised of a spatiotemporal video tokenizer, an autoregressive dynamics
model, and a simple and scalable latent action model. Genie enables users to
act in the generated environments on a frame-by-frame basis despite training
without any ground-truth action labels or other domain-specific requirements
typically found in the world model literature. Further the resulting learned
latent action space facilitates training agents to imitate behaviors from
unseen videos, opening the path for training generalist agents of the future.
Related papers
- Dreamitate: Real-World Visuomotor Policy Learning via Video Generation [49.03287909942888]
We propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task.
We generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot.
arXiv Detail & Related papers (2024-06-24T17:59:45Z) - Pandora: Towards General World Model with Natural Language Actions and Video States [61.30962762314734]
Pandora is a hybrid autoregressive-diffusion model that simulates world states by generating videos and allows real-time control with free-text actions.
Pandora achieves domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning.
arXiv Detail & Related papers (2024-06-12T18:55:51Z) - iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals.
iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications.
arXiv Detail & Related papers (2024-05-24T05:29:12Z) - Large-Scale Actionless Video Pre-Training via Discrete Diffusion for
Efficient Policy Learning [73.69573252516761]
We introduce a novel framework that combines generative pre-training on human videos and policy fine-tuning on action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - WorldDreamer: Towards General World Models for Video Generation via
Predicting Masked Tokens [75.02160668328425]
We introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions.
WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge.
Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments.
arXiv Detail & Related papers (2024-01-18T14:01:20Z) - Learning Universal Policies via Text-Guided Video Generation [179.6347119101618]
A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks.
Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images.
We investigate whether such tools can be used to construct more general-purpose agents.
arXiv Detail & Related papers (2023-01-31T21:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.