Masked Visual Pre-training for Motor Control
- URL: http://arxiv.org/abs/2203.06173v1
- Date: Fri, 11 Mar 2022 18:58:10 GMT
- Title: Masked Visual Pre-training for Motor Control
- Authors: Tete Xiao, Ilija Radosavovic, Trevor Darrell, Jitendra Malik
- Abstract summary: Self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels.
We freeze the visual encoder and train neural network controllers on top with reinforcement learning.
This is the first self-supervised model to exploit real-world images at scale for motor control.
- Score: 118.18189211080225
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper shows that self-supervised visual pre-training from real-world
images is effective for learning motor control tasks from pixels. We first
train the visual representations by masked modeling of natural images. We then
freeze the visual encoder and train neural network controllers on top with
reinforcement learning. We do not perform any task-specific fine-tuning of the
encoder; the same visual representations are used for all motor control tasks.
To the best of our knowledge, this is the first self-supervised model to
exploit real-world images at scale for motor control. To accelerate progress in
learning from pixels, we contribute a benchmark suite of hand-designed tasks
varying in movements, scenes, and robots. Without relying on labels,
state-estimation, or expert demonstrations, we consistently outperform
supervised encoders by up to 80% absolute success rate, sometimes even matching
the oracle state performance. We also find that in-the-wild images, e.g., from
YouTube or Egocentric videos, lead to better visual representations for various
manipulation tasks than ImageNet images.
Related papers
- ViSaRL: Visual Reinforcement Learning Guided by Human Saliency [6.969098096933547]
We introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL)
Using ViSaRL to learn visual representations significantly improves the success rate, sample efficiency, and generalization of an RL agent.
We show that visual representations learned using ViSaRL are robust to various sources of visual perturbations including perceptual noise and scene variations.
arXiv Detail & Related papers (2024-03-16T14:52:26Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Multi-View Masked World Models for Visual Robotic Manipulation [132.97980128530017]
We train a multi-view masked autoencoder which reconstructs pixels of randomly masked viewpoints.
We demonstrate the effectiveness of our method in a range of scenarios.
We also show that the multi-view masked autoencoder trained with multiple randomized viewpoints enables training a policy with strong viewpoint randomization.
arXiv Detail & Related papers (2023-02-05T15:37:02Z) - Real-World Robot Learning with Masked Visual Pre-training [161.88981509645416]
In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks.
Our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module.
We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%).
arXiv Detail & Related papers (2022-10-06T17:59:01Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z) - Active Perception and Representation for Robotic Manipulation [0.8315801422499861]
We present a framework that leverages the benefits of active perception to accomplish manipulation tasks.
Our agent uses viewpoint changes to localize objects, to learn state representations in a self-supervised manner, and to perform goal-directed actions.
Compared to vanilla deep Q-learning algorithms, our model is at least four times more sample-efficient.
arXiv Detail & Related papers (2020-03-15T01:43:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.