Learning Universal Policies via Text-Guided Video Generation
- URL: http://arxiv.org/abs/2302.00111v3
- Date: Mon, 20 Nov 2023 05:38:13 GMT
- Title: Learning Universal Policies via Text-Guided Video Generation
- Authors: Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B.
Tenenbaum, Dale Schuurmans, Pieter Abbeel
- Abstract summary: A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks.
Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images.
We investigate whether such tools can be used to construct more general-purpose agents.
- Score: 179.6347119101618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A goal of artificial intelligence is to construct an agent that can solve a
wide variety of tasks. Recent progress in text-guided image synthesis has
yielded models with an impressive ability to generate complex novel images,
exhibiting combinatorial generalization across domains. Motivated by this
success, we investigate whether such tools can be used to construct more
general-purpose agents. Specifically, we cast the sequential decision making
problem as a text-conditioned video generation problem, where, given a
text-encoded specification of a desired goal, a planner synthesizes a set of
future frames depicting its planned actions in the future, after which control
actions are extracted from the generated video. By leveraging text as the
underlying goal specification, we are able to naturally and combinatorially
generalize to novel goals. The proposed policy-as-video formulation can further
represent environments with different state and action spaces in a unified
space of images, which, for example, enables learning and generalization across
a variety of robot manipulation tasks. Finally, by leveraging pretrained
language embeddings and widely available videos from the internet, the approach
enables knowledge transfer through predicting highly realistic video plans for
real robots.
Related papers
- Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts [21.249837293326497]
Generalizable reward function is central to reinforcement learning and planning for robots.
This paper transfers video-language models with robust generalization into a language-conditioned reward function.
Our model shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.
arXiv Detail & Related papers (2024-07-20T13:22:59Z) - Dreamitate: Real-World Visuomotor Policy Learning via Video Generation [49.03287909942888]
We propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task.
We generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot.
arXiv Detail & Related papers (2024-06-24T17:59:45Z) - Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models.
Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task.
It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z) - Learning to Act from Actionless Videos through Dense Correspondences [87.1243107115642]
We present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments.
Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals.
We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks.
arXiv Detail & Related papers (2023-10-12T17:59:23Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.