Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online
Videos
- URL: http://arxiv.org/abs/2206.11795v1
- Date: Thu, 23 Jun 2022 16:01:11 GMT
- Title: Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online
Videos
- Authors: Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang,
Adrien Ecoffet, Brandon Houghton, Raul Sampedro, Jeff Clune
- Abstract summary: We extend the internet-scale pretraining paradigm to sequential decision domains through semi-trivial imitation learning.
We show that this behavioral prior has non zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning.
For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools.
- Score: 16.858980871368175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretraining on noisy, internet-scale datasets has been heavily studied as a
technique for training models with broad, general capabilities for text,
images, and other modalities. However, for many sequential decision domains
such as robotics, video games, and computer use, publicly available data does
not contain the labels required to train behavioral priors in the same way. We
extend the internet-scale pretraining paradigm to sequential decision domains
through semi-supervised imitation learning wherein agents learn to act by
watching online unlabeled videos. Specifically, we show that with a small
amount of labeled data we can train an inverse dynamics model accurate enough
to label a huge unlabeled source of online data -- here, online videos of
people playing Minecraft -- from which we can then train a general behavioral
prior. Despite using the native human interface (mouse and keyboard at 20Hz),
we show that this behavioral prior has nontrivial zero-shot capabilities and
that it can be fine-tuned, with both imitation learning and reinforcement
learning, to hard-exploration tasks that are impossible to learn from scratch
via reinforcement learning. For many tasks our models exhibit human-level
performance, and we are the first to report computer agents that can craft
diamond tools, which can take proficient humans upwards of 20 minutes (24,000
environment actions) of gameplay to accomplish.
Related papers
- Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)
LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Playful Interactions for Representation Learning [82.59215739257104]
We propose to use playful interactions in a self-supervised manner to learn visual representations for downstream tasks.
We collect 2 hours of playful data in 19 diverse environments and use self-predictive learning to extract visual representations.
Our representations generalize better than standard behavior cloning and can achieve similar performance with only half the number of required demonstrations.
arXiv Detail & Related papers (2021-07-19T17:54:48Z) - Actionable Models: Unsupervised Offline Reinforcement Learning of
Robotic Skills [93.12417203541948]
We propose the objective of learning a functional understanding of the environment by learning to reach any goal state in a given dataset.
We find that our method can operate on high-dimensional camera images and learn a variety of skills on real robots that generalize to previously unseen scenes and objects.
arXiv Detail & Related papers (2021-04-15T20:10:11Z) - Learning Object Manipulation Skills via Approximate State Estimation
from Real Videos [47.958512470724926]
Humans are adept at learning new tasks by watching a few instructional videos.
On the other hand, robots that learn new actions either require a lot of effort through trial and error, or use expert demonstrations that are challenging to obtain.
In this paper, we explore a method that facilitates learning object manipulation skills directly from videos.
arXiv Detail & Related papers (2020-11-13T08:53:47Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z) - Learning to Play by Imitating Humans [8.209859328381269]
We show that it is possible to acquire a diverse set of skills by self-supervising control on top of human teleoperated play data.
By training a behavioral cloning policy on a relatively small quantity of human play, we autonomously generate a large quantity of cloned play data.
We demonstrate that a general purpose goal-conditioned policy trained on this augmented dataset substantially outperforms one trained only with the original human data.
arXiv Detail & Related papers (2020-06-11T23:28:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.