Genie: Generative Interactive Environments
- URL: http://arxiv.org/abs/2402.15391v1
- Date: Fri, 23 Feb 2024 15:47:26 GMT
- Title: Genie: Generative Interactive Environments
- Authors: Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge
Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris
Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas
Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei
Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, Tim
Rockt\"aschel
- Abstract summary: We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos.
The model can be prompted to generate a variety of action-controllable virtual worlds described through text, synthetic images, and even sketches.
- Score: 44.65662949794694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Genie, the first generative interactive environment trained in
an unsupervised manner from unlabelled Internet videos. The model can be
prompted to generate an endless variety of action-controllable virtual worlds
described through text, synthetic images, photographs, and even sketches. At
11B parameters, Genie can be considered a foundation world model. It is
comprised of a spatiotemporal video tokenizer, an autoregressive dynamics
model, and a simple and scalable latent action model. Genie enables users to
act in the generated environments on a frame-by-frame basis despite training
without any ground-truth action labels or other domain-specific requirements
typically found in the world model literature. Further the resulting learned
latent action space facilitates training agents to imitate behaviors from
unseen videos, opening the path for training generalist agents of the future.
Related papers
- DreamGen: Unlocking Generalization in Robot Learning through Video World Models [120.25799361925387]
DreamGen is a pipeline for training robot policies that generalize across behaviors and environments through neural trajectories.<n>Our work establishes a promising new axis for scaling robot learning well beyond manual data collection.
arXiv Detail & Related papers (2025-05-19T04:55:39Z) - Exploration-Driven Generative Interactive Environments [53.05314852577144]
We focus on using many virtual environments for inexpensive, automatically collected interaction data.
We propose a training framework merely using a random agent in virtual environments.
Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments.
arXiv Detail & Related papers (2025-04-03T12:01:41Z) - AdaWorld: Learning Adaptable World Models with Latent Actions [76.50869178593733]
We propose AdaWorld, an innovative world model learning approach that enables efficient adaptation.
Key idea is to incorporate action information during the pretraining of world models.
We then develop an autoregressive world model that conditions on these latent actions.
arXiv Detail & Related papers (2025-03-24T17:58:15Z) - Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination [25.62602420895531]
DreMa is a new approach for constructing digital twins using learned explicit representations of the real world and its dynamics.
We show that DreMa can successfully learn novel physical tasks from just a single example per task variation.
arXiv Detail & Related papers (2024-12-19T15:38:15Z) - Learning Generative Interactive Environments By Trained Agent Exploration [41.94295877935867]
We propose to improve the model by employing reinforcement learning based agents for data generation.
This approach produces diverse datasets that enhance the model's ability to adapt and perform well.
Our evaluation, including a replication of the Coinrun case study, shows that GenieRedux-G achieves superior visual fidelity and controllability.
arXiv Detail & Related papers (2024-09-10T12:00:40Z) - Pandora: Towards General World Model with Natural Language Actions and Video States [61.30962762314734]
Pandora is a hybrid autoregressive-diffusion model that simulates world states by generating videos and allows real-time control with free-text actions.
Pandora achieves domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning.
arXiv Detail & Related papers (2024-06-12T18:55:51Z) - iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making.
This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens.
iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z) - WorldDreamer: Towards General World Models for Video Generation via
Predicting Masked Tokens [75.02160668328425]
We introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions.
WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge.
Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments.
arXiv Detail & Related papers (2024-01-18T14:01:20Z) - Learning Universal Policies via Text-Guided Video Generation [179.6347119101618]
A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks.
Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images.
We investigate whether such tools can be used to construct more general-purpose agents.
arXiv Detail & Related papers (2023-01-31T21:28:13Z) - Evaluating Continual Learning Algorithms by Generating 3D Virtual
Environments [66.83839051693695]
Continual learning refers to the ability of humans and animals to incrementally learn over time in a given environment.
We propose to leverage recent advances in 3D virtual environments in order to approach the automatic generation of potentially life-long dynamic scenes with photo-realistic appearance.
A novel element of this paper is that scenes are described in a parametric way, thus allowing the user to fully control the visual complexity of the input stream the agent perceives.
arXiv Detail & Related papers (2021-09-16T10:37:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.