Related papers: Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning

Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning

URL: http://arxiv.org/abs/2209.10901v2
Date: Tue, 18 Jul 2023 16:12:16 GMT
Title: Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning
Authors: Manuel Goul\~ao and Arlindo L. Oliveira
Abstract summary: We study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations. Our results show that all methods are effective in learning useful representations and avoiding representational collapse. The encoder pretrained with the temporal order verification task shows the best results across all experiments.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, convolutional neural networks (CNN) remain the preferential architecture for the representation module in reinforcement learning. In this work, we study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations. To show the importance of the temporal dimension in this context we propose an extension of VICReg to better capture temporal relations between observations by adding a temporal order verification task. Our results show that all methods are effective in learning useful representations and avoiding representational collapse for observations from Atari Learning Environment (ALE) which leads to improvements in data efficiency when we evaluated in reinforcement learning (RL). Moreover, the encoder pretrained with the temporal order verification task shows the best results across all experiments, with richer representations, more focused attention maps and sparser representation vectors throughout the layers of the encoder, which shows the importance of exploring such similarity dimension. With this work, we hope to provide some insights into the representations learned by ViT during a self-supervised pretraining with observations from RL environments and which properties arise in the representations that lead to the best-performing agents. The source code will be available at: https://github.com/mgoulao/TOV-VICReg

Related papers

IN45023 Neural Network Design Patterns in Computer Vision Seminar Report, Summer 2025 [0.0]
This report analyzes the evolution of key design patterns in computer vision by examining six influential papers.<n>We review ResNet, which introduced residual connections to overcome the vanishing gradient problem.<n>We examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches.
arXiv Detail & Related papers (2025-07-31T09:08:11Z)
Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We propose a new efficient post-training stage for ViTs called locality alignment. We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z)
Value Explicit Pretraining for Learning Transferable Representations [11.069853883599102]
We propose a method that learns generalizable representations for transfer reinforcement learning. We learn new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations. Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods.
arXiv Detail & Related papers (2023-12-19T17:12:35Z)
What Makes Pre-Trained Visual Representations Successful for Robust Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture. We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z)
Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z)
SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z)
Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs) SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token. Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z)
PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object. PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects. We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z)
Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. Are they acting like convolutional networks, or learning entirely different visual representations? We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z)
PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning [109.84770951839289]
We present PredRNN, a new recurrent network for learning visual dynamics from historical context. We show that our approach obtains highly competitive results on three standard datasets.
arXiv Detail & Related papers (2021-03-17T08:28:30Z)
Representation Learning with Video Deep InfoMax [26.692717942430185]
We extend DeepInfoMax to the video domain by leveraging similar structure intemporal networks. We find that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks.
arXiv Detail & Related papers (2020-07-27T02:28:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.