R3M: A Universal Visual Representation for Robot Manipulation
- URL: http://arxiv.org/abs/2203.12601v1
- Date: Wed, 23 Mar 2022 17:55:09 GMT
- Title: R3M: A Universal Visual Representation for Robot Manipulation
- Authors: Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, Abhinav
Gupta
- Abstract summary: We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of robotic manipulation tasks.
We find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo.
- Score: 91.55543664116209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study how visual representations pre-trained on diverse human video data
can enable data-efficient learning of downstream robotic manipulation tasks.
Concretely, we pre-train a visual representation using the Ego4D human video
dataset using a combination of time-contrastive learning, video-language
alignment, and an L1 penalty to encourage sparse and compact representations.
The resulting representation, R3M, can be used as a frozen perception module
for downstream policy learning. Across a suite of 12 simulated robot
manipulation tasks, we find that R3M improves task success by over 20% compared
to training from scratch and by over 10% compared to state-of-the-art visual
representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika
Panda arm to learn a range of manipulation tasks in a real, cluttered apartment
given just 20 demonstrations. Code and pre-trained models are available at
https://tinyurl.com/robotr3m.
Related papers
- LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations [19.41216557646392]
3D Diffusion Policy (DP3) is a novel visual imitation learning approach.
In experiments, DP3 handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement.
In real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do.
arXiv Detail & Related papers (2024-03-06T18:58:49Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - RVT: Robotic View Transformer for 3D Object Manipulation [46.25268237442356]
We propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate.
A single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct)
arXiv Detail & Related papers (2023-06-26T17:59:31Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control.
Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z) - A Framework for Efficient Robotic Manipulation [79.10407063260473]
We show that a single robotic arm can learn sparse-reward manipulation policies from pixels.
We show that, given only 10 demonstrations, a single robotic arm can learn sparse-reward manipulation policies from pixels.
arXiv Detail & Related papers (2020-12-14T22:18:39Z) - KOVIS: Keypoint-based Visual Servoing with Zero-Shot Sim-to-Real
Transfer for Robotics Manipulation [8.81267687440119]
KOVIS is a learning-based, calibration-free visual servoing method for fine robotic manipulation tasks with eye-in-hand stereo camera system.
We train the deep neural network only in the simulated environment.
We demonstrate the effectiveness of the proposed method in both simulated environment and real-world experiment with different robotic manipulation tasks.
arXiv Detail & Related papers (2020-07-28T02:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.