Evaluating Vision Transformer Methods for Deep Reinforcement Learning
from Pixels
- URL: http://arxiv.org/abs/2204.04905v1
- Date: Mon, 11 Apr 2022 07:10:58 GMT
- Title: Evaluating Vision Transformer Methods for Deep Reinforcement Learning
from Pixels
- Authors: Tianxin Tao, Daniele Reda, Michiel van de Panne
- Abstract summary: We evaluate Vision Transformers (ViT) training methods for image-based reinforcement learning control tasks.
We compare these results to a leading convolutional-network architecture method, RAD.
We find that the CNN architectures trained using RAD still generally provide superior performance.
- Score: 7.426118390008397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViT) have recently demonstrated the significant
potential of transformer architectures for computer vision. To what extent can
image-based deep reinforcement learning also benefit from ViT architectures, as
compared to standard convolutional neural network (CNN) architectures? To
answer this question, we evaluate ViT training methods for image-based
reinforcement learning (RL) control tasks and compare these results to a
leading convolutional-network architecture method, RAD. For training the ViT
encoder, we consider several recently-proposed self-supervised losses that are
treated as auxiliary tasks, as well as a baseline with no additional loss
terms. We find that the CNN architectures trained using RAD still generally
provide superior performance. For the ViT methods, all three types of auxiliary
tasks that we consider provide a benefit over plain ViT training. Furthermore,
ViT masking-based tasks are found to significantly outperform ViT
contrastive-learning.
Related papers
- T-TAME: Trainable Attention Mechanism for Explaining Convolutional
Networks and Vision Transformers [9.284740716447342]
"Black box" nature of neural networks is a barrier to adoption in applications where explainability is essential.
This paper presents T-TAME, Transformer-compatible Trainable Attention Mechanism for Explanations.
Proposed architecture and training technique can be easily applied to any convolutional or Vision Transformer-like neural network.
arXiv Detail & Related papers (2024-03-07T14:25:03Z) - Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - When Adversarial Training Meets Vision Transformers: Recipes from
Training to Architecture [32.260596998171835]
Adrial training is still required for ViTs to defend against such adversarial attacks.
We find that pre-training and SGD are necessary for ViTs' adversarial training.
Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
arXiv Detail & Related papers (2022-10-14T05:37:20Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - ViTGAN: Training GANs with Vision Transformers [46.769407314698434]
Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases.
We introduce several novel regularization techniques for training GANs with ViTs.
Our approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on three datasets.
arXiv Detail & Related papers (2021-07-09T17:59:30Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.