Learning to Act from Actionless Videos through Dense Correspondences
- URL: http://arxiv.org/abs/2310.08576v1
- Date: Thu, 12 Oct 2023 17:59:23 GMT
- Title: Learning to Act from Actionless Videos through Dense Correspondences
- Authors: Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, Joshua B. Tenenbaum
- Abstract summary: We present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments.
Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals.
We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks.
- Score: 87.1243107115642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present an approach to construct a video-based robot policy
capable of reliably executing diverse tasks across different robots and
environments from few video demonstrations without using any action
annotations. Our method leverages images as a task-agnostic representation,
encoding both the state and action information, and text as a general
representation for specifying robot goals. By synthesizing videos that
``hallucinate'' robot executing actions and in combination with dense
correspondences between frames, our approach can infer the closed-formed action
to execute to an environment without the need of any explicit action labels.
This unique capability allows us to train the policy solely based on RGB videos
and deploy learned policies to various robotic tasks. We demonstrate the
efficacy of our approach in learning policies on table-top manipulation and
navigation tasks. Additionally, we contribute an open-source framework for
efficient video modeling, enabling the training of high-fidelity policy models
with four GPUs within a single day.
Related papers
- OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation [35.97702591413093]
We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video.
OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately.
arXiv Detail & Related papers (2024-10-15T17:17:54Z) - Dreamitate: Real-World Visuomotor Policy Learning via Video Generation [49.03287909942888]
We propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task.
We generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot.
arXiv Detail & Related papers (2024-06-24T17:59:45Z) - Vision-based Manipulation from Single Human Video with Open-World Object Graphs [58.23098483464538]
We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos.
We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video.
arXiv Detail & Related papers (2024-05-30T17:56:54Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - Learning Universal Policies via Text-Guided Video Generation [179.6347119101618]
A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks.
Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images.
We investigate whether such tools can be used to construct more general-purpose agents.
arXiv Detail & Related papers (2023-01-31T21:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.