Pretraining the Vision Transformer using self-supervised methods for
vision based Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2209.10901v2
- Date: Tue, 18 Jul 2023 16:12:16 GMT
- Title: Pretraining the Vision Transformer using self-supervised methods for
vision based Deep Reinforcement Learning
- Authors: Manuel Goul\~ao and Arlindo L. Oliveira
- Abstract summary: We study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations.
Our results show that all methods are effective in learning useful representations and avoiding representational collapse.
The encoder pretrained with the temporal order verification task shows the best results across all experiments.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Vision Transformer architecture has shown to be competitive in the
computer vision (CV) space where it has dethroned convolution-based networks in
several benchmarks. Nevertheless, convolutional neural networks (CNN) remain
the preferential architecture for the representation module in reinforcement
learning. In this work, we study pretraining a Vision Transformer using several
state-of-the-art self-supervised methods and assess the quality of the learned
representations. To show the importance of the temporal dimension in this
context we propose an extension of VICReg to better capture temporal relations
between observations by adding a temporal order verification task. Our results
show that all methods are effective in learning useful representations and
avoiding representational collapse for observations from Atari Learning
Environment (ALE) which leads to improvements in data efficiency when we
evaluated in reinforcement learning (RL). Moreover, the encoder pretrained with
the temporal order verification task shows the best results across all
experiments, with richer representations, more focused attention maps and
sparser representation vectors throughout the layers of the encoder, which
shows the importance of exploring such similarity dimension. With this work, we
hope to provide some insights into the representations learned by ViT during a
self-supervised pretraining with observations from RL environments and which
properties arise in the representations that lead to the best-performing
agents. The source code will be available at:
https://github.com/mgoulao/TOV-VICReg
Related papers
- Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.
We propose a new efficient post-training stage for ViTs called locality alignment.
We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z) - Value Explicit Pretraining for Learning Transferable Representations [11.069853883599102]
We propose a method that learns generalizable representations for transfer reinforcement learning.
We learn new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations.
Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods.
arXiv Detail & Related papers (2023-12-19T17:12:35Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - Representation Learning with Video Deep InfoMax [26.692717942430185]
We extend DeepInfoMax to the video domain by leveraging similar structure intemporal networks.
We find that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks.
arXiv Detail & Related papers (2020-07-27T02:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.