Real-World Robot Learning with Masked Visual Pre-training
- URL: http://arxiv.org/abs/2210.03109v1
- Date: Thu, 6 Oct 2022 17:59:01 GMT
- Title: Real-World Robot Learning with Masked Visual Pre-training
- Authors: Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra
Malik, Trevor Darrell
- Abstract summary: In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks.
Our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module.
We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%).
- Score: 161.88981509645416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we explore self-supervised visual pre-training on images from
diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our
visual representations are pre-trained via a masked autoencoder (MAE), frozen,
and then passed into a learnable control module. Unlike prior work, we show
that the pre-trained representations are effective across a range of real-world
robotic tasks and embodiments. We find that our encoder consistently
outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and
training from scratch (up to 81%). Finally, we train a 307M parameter vision
transformer on a massive collection of 4.5M images from the Internet and
egocentric videos, and demonstrate clearly the benefits of scaling visual
pre-training for robot learning.
Related papers
- Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Exploring Visual Pre-training for Robot Manipulation: Datasets, Models
and Methods [14.780597545674157]
We investigate the effects of visual pre-training strategies on robot manipulation tasks from three fundamental perspectives.
We propose a visual pre-training scheme for robot manipulation termed Vi-PRoM, which combines self-supervised learning and supervised learning.
arXiv Detail & Related papers (2023-08-07T14:24:52Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z) - R3M: A Universal Visual Representation for Robot Manipulation [91.55543664116209]
We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of robotic manipulation tasks.
We find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo.
arXiv Detail & Related papers (2022-03-23T17:55:09Z) - Masked Visual Pre-training for Motor Control [118.18189211080225]
Self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels.
We freeze the visual encoder and train neural network controllers on top with reinforcement learning.
This is the first self-supervised model to exploit real-world images at scale for motor control.
arXiv Detail & Related papers (2022-03-11T18:58:10Z) - Monocular Robot Navigation with Self-Supervised Pretrained Vision
Transformers [10.452316044889177]
We train a coarse image segmentation model for the Duckietown environment using 70 training images.
Our model performs coarse image segmentation at the 8x8 patch level, and the inference resolution can be adjusted to balance prediction granularity and real-time perception constraints.
The resulting perception model is used as the backbone for a simple yet robust visual servoing agent.
arXiv Detail & Related papers (2022-03-07T19:47:52Z) - KOVIS: Keypoint-based Visual Servoing with Zero-Shot Sim-to-Real
Transfer for Robotics Manipulation [8.81267687440119]
KOVIS is a learning-based, calibration-free visual servoing method for fine robotic manipulation tasks with eye-in-hand stereo camera system.
We train the deep neural network only in the simulated environment.
We demonstrate the effectiveness of the proposed method in both simulated environment and real-world experiment with different robotic manipulation tasks.
arXiv Detail & Related papers (2020-07-28T02:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.