R3M: A Universal Visual Representation for Robot Manipulation
- URL: http://arxiv.org/abs/2203.12601v1
- Date: Wed, 23 Mar 2022 17:55:09 GMT
- Title: R3M: A Universal Visual Representation for Robot Manipulation
- Authors: Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, Abhinav
Gupta
- Abstract summary: We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of robotic manipulation tasks.
We find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo.
- Score: 91.55543664116209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study how visual representations pre-trained on diverse human video data
can enable data-efficient learning of downstream robotic manipulation tasks.
Concretely, we pre-train a visual representation using the Ego4D human video
dataset using a combination of time-contrastive learning, video-language
alignment, and an L1 penalty to encourage sparse and compact representations.
The resulting representation, R3M, can be used as a frozen perception module
for downstream policy learning. Across a suite of 12 simulated robot
manipulation tasks, we find that R3M improves task success by over 20% compared
to training from scratch and by over 10% compared to state-of-the-art visual
representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika
Panda arm to learn a range of manipulation tasks in a real, cluttered apartment
given just 20 demonstrations. Code and pre-trained models are available at
https://tinyurl.com/robotr3m.
Related papers
- Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets [24.77850617214567]
We propose a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks.
Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions.
We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss.
arXiv Detail & Related papers (2024-10-29T17:58:13Z) - Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction [51.49400490437258]
This work develops a method for imitating articulated object manipulation from a single monocular RGB human demonstration.
We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video.
Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion.
We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot.
arXiv Detail & Related papers (2024-09-26T17:57:16Z) - HRP: Human Affordances for Robotic Pre-Training [15.92416819748365]
We present a framework for pre-training representations on hand, object, and contact.
We experimentally demonstrate (using 3000+ robot trials) that this affordance pre-training scheme boosts performance by a minimum of 15% on 5 real-world tasks.
arXiv Detail & Related papers (2024-07-26T17:59:52Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations [19.41216557646392]
3D Diffusion Policy (DP3) is a novel visual imitation learning approach.
In experiments, DP3 handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement.
In real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do.
arXiv Detail & Related papers (2024-03-06T18:58:49Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - RVT: Robotic View Transformer for 3D Object Manipulation [46.25268237442356]
We propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate.
A single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct)
arXiv Detail & Related papers (2023-06-26T17:59:31Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - A Framework for Efficient Robotic Manipulation [79.10407063260473]
We show that a single robotic arm can learn sparse-reward manipulation policies from pixels.
We show that, given only 10 demonstrations, a single robotic arm can learn sparse-reward manipulation policies from pixels.
arXiv Detail & Related papers (2020-12-14T22:18:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.