Unleashing Large-Scale Video Generative Pre-training for Visual Robot
Manipulation
- URL: http://arxiv.org/abs/2312.13139v2
- Date: Thu, 21 Dec 2023 05:34:23 GMT
- Title: Unleashing Large-Scale Video Generative Pre-training for Visual Robot
Manipulation
- Authors: Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu,
Xinghang Li, Minghuan Liu, Hang Li, Tao Kong
- Abstract summary: We introduce GR-1, a GPT-style model designed for multi-task language-conditioned visual robot manipulation.
GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states.
It predicts robot actions as well as future images in an end-to-end manner.
- Score: 25.09113607683987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative pre-trained models have demonstrated remarkable effectiveness in
language and vision domains by learning useful representations. In this paper,
we extend the scope of this effectiveness by showing that visual robot
manipulation can significantly benefit from large-scale video generative
pre-training. We introduce GR-1, a straightforward GPT-style model designed for
multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs
a language instruction, a sequence of observation images, and a sequence of
robot states. It predicts robot actions as well as future images in an
end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly
finetuned on robot data after pre-trained on a large-scale video dataset. We
perform extensive experiments on the challenging CALVIN benchmark and a real
robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline
methods and improves the success rate from 88.9% to 94.9%. In the setting of
zero-shot unseen scene generalization, GR-1 improves the success rate from
53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline
methods and shows strong potentials in generalization to unseen scenes and
objects. We provide inaugural evidence that a unified GPT-style transformer,
augmented with large-scale video generative pre-training, exhibits remarkable
generalization to multi-task visual robot manipulation. Project page:
https://GR1-Manipulation.github.io
Related papers
- Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets [24.77850617214567]
We propose a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks.
Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions.
We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss.
arXiv Detail & Related papers (2024-10-29T17:58:13Z) - Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)
LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation [21.455124378231957]
GR-2 is a state-of-the-art generalist robot agent for versatile and generalizable manipulation.
GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world.
GR-2 exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks.
arXiv Detail & Related papers (2024-10-08T16:00:47Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - Towards Generalizable Zero-Shot Manipulation via Translating Human
Interaction Plans [58.27029676638521]
We show how passive human videos can serve as a rich source of data for learning such generalist robots.
We learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations.
We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects.
arXiv Detail & Related papers (2023-12-01T18:54:12Z) - Exploring Visual Pre-training for Robot Manipulation: Datasets, Models
and Methods [14.780597545674157]
We investigate the effects of visual pre-training strategies on robot manipulation tasks from three fundamental perspectives.
We propose a visual pre-training scheme for robot manipulation termed Vi-PRoM, which combines self-supervised learning and supervised learning.
arXiv Detail & Related papers (2023-08-07T14:24:52Z) - Giving Robots a Hand: Learning Generalizable Manipulation with
Eye-in-Hand Human Video Demonstrations [66.47064743686953]
Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation.
Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation.
In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies.
arXiv Detail & Related papers (2023-07-12T07:04:53Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks.
One of the key contributing factors to this progress is the scale of robot data used to train the models.
We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z) - Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos.
Our framework is based on predicting plausible human hand trajectories.
We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.