Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation
- URL: http://arxiv.org/abs/2304.06600v1
- Date: Thu, 13 Apr 2023 15:06:28 GMT
- Title: Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation
- Authors: Mohit Sharma, Claudio Fantacci, Yuxiang Zhou, Skanda Koppula, Nicolas
Heess, Jon Scholz, Yusuf Aytar
- Abstract summary: Large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems.
We introduce "lossless adaptation" to address this shortcoming of classical fine-tuning.
We demonstrate that appropriate placement of our parameter efficient adapters can significantly reduce the performance gap between frozen pretrained representations and full end-to-end fine-tuning.
- Score: 25.47207030637466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have shown that large models pretrained on common visual
learning tasks can provide useful representations for a wide range of
specialized perception problems, as well as a variety of robotic manipulation
tasks. While prior work on robotic manipulation has predominantly used frozen
pretrained features, we demonstrate that in robotics this approach can fail to
reach optimal performance, and that fine-tuning of the full model can lead to
significantly better results. Unfortunately, fine-tuning disrupts the
pretrained visual representation, and causes representational drift towards the
fine-tuned task thus leading to a loss of the versatility of the original
model. We introduce "lossless adaptation" to address this shortcoming of
classical fine-tuning. We demonstrate that appropriate placement of our
parameter efficient adapters can significantly reduce the performance gap
between frozen pretrained representations and full end-to-end fine-tuning
without changes to the original representation and thus preserving original
capabilities of the pretrained model. We perform a comprehensive investigation
across three major model architectures (ViTs, NFNets, and ResNets), supervised
(ImageNet-1K classification) and self-supervised pretrained weights (CLIP,
BYOL, Visual MAE) in 3 task domains and 35 individual tasks, and demonstrate
that our claims are strongly validated in various settings.
Related papers
- ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.
Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.
We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z) - Equivariant Descriptor Fields: SE(3)-Equivariant Energy-Based Models for
End-to-End Visual Robotic Manipulation Learning [2.8388425545775386]
We present end-to-end SE(3)-equivariant models for visual robotic manipulation from a point cloud input.
We show that our models can learn from scratch without prior knowledge yet is highly sample efficient.
arXiv Detail & Related papers (2022-06-16T17:26:06Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.