What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation?
- URL: http://arxiv.org/abs/2312.12444v1
- Date: Fri, 3 Nov 2023 18:09:08 GMT
- Title: What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation?
- Authors: Kaylee Burns, Zach Witzel, Jubayer Ibn Hamid, Tianhe Yu, Chelsea Finn,
Karol Hausman
- Abstract summary: We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
- Score: 57.92924256181857
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by the success of transfer learning in computer vision, roboticists
have investigated visual pre-training as a means to improve the learning
efficiency and generalization ability of policies learned from pixels. To that
end, past work has favored large object interaction datasets, such as
first-person videos of humans completing diverse tasks, in pursuit of
manipulation-relevant features. Although this approach improves the efficiency
of policy learning, it remains unclear how reliable these representations are
in the presence of distribution shifts that arise commonly in robotic
applications. Surprisingly, we find that visual representations designed for
manipulation and control tasks do not necessarily generalize under subtle
changes in lighting and scene texture or the introduction of distractor
objects. To understand what properties do lead to robust representations, we
compare the performance of 15 pre-trained vision models under different visual
appearances. We find that emergent segmentation ability is a strong predictor
of out-of-distribution generalization among ViT models. The rank order induced
by this metric is more predictive than metrics that have previously guided
generalization research within computer vision and machine learning, such as
downstream ImageNet accuracy, in-domain accuracy, or shape-bias as evaluated by
cue-conflict performance. We test this finding extensively on a suite of
distribution shifts in ten tasks across two simulated manipulation
environments. On the ALOHA setup, segmentation score predicts real-world
performance after offline training with 50 demonstrations.
Related papers
- Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Value Explicit Pretraining for Learning Transferable Representations [11.069853883599102]
We propose a method that learns generalizable representations for transfer reinforcement learning.
We learn new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations.
Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods.
arXiv Detail & Related papers (2023-12-19T17:12:35Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - CIPER: Combining Invariant and Equivariant Representations Using
Contrastive and Predictive Learning [6.117084972237769]
We introduce Contrastive Invariant and Predictive Equivariant Representation learning (CIPER)
CIPER comprises both invariant and equivariant learning objectives using one shared encoder and two different output heads on top of the encoder.
We evaluate our method on static image tasks and time-augmented image datasets.
arXiv Detail & Related papers (2023-02-05T07:50:46Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Learning to See before Learning to Act: Visual Pre-training for
Manipulation [48.731528716324355]
We find that pre-training on vision tasks significantly improves generalization and sample efficiency for learning to manipulate objects.
We explore directly transferring model parameters from vision networks to affordance prediction networks, and show that this can result in successful zero-shot adaptation.
With just a small amount of robotic experience, we can further fine-tune the affordance model to achieve better results.
arXiv Detail & Related papers (2021-07-01T17:58:37Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.