Related papers: Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

URL: http://arxiv.org/abs/2204.04905v1
Date: Mon, 11 Apr 2022 07:10:58 GMT
Title: Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels
Authors: Tianxin Tao, Daniele Reda, Michiel van de Panne
Abstract summary: We evaluate Vision Transformers (ViT) training methods for image-based reinforcement learning control tasks. We compare these results to a leading convolutional-network architecture method, RAD. We find that the CNN architectures trained using RAD still generally provide superior performance.
Score: 7.426118390008397
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers (ViT) have recently demonstrated the significant potential of transformer architectures for computer vision. To what extent can image-based deep reinforcement learning also benefit from ViT architectures, as compared to standard convolutional neural network (CNN) architectures? To answer this question, we evaluate ViT training methods for image-based reinforcement learning (RL) control tasks and compare these results to a leading convolutional-network architecture method, RAD. For training the ViT encoder, we consider several recently-proposed self-supervised losses that are treated as auxiliary tasks, as well as a baseline with no additional loss terms. We find that the CNN architectures trained using RAD still generally provide superior performance. For the ViT methods, all three types of auxiliary tasks that we consider provide a benefit over plain ViT training. Furthermore, ViT masking-based tasks are found to significantly outperform ViT contrastive-learning.

Related papers

Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition [49.14350399025926]
We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition. Middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images.
arXiv Detail & Related papers (2024-07-28T11:52:36Z)
Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis [38.074487843137064]
This paper investigates the effectiveness of self-supervised pre-trained vision transformers (ViTs) compared to supervised pre-trained ViTs and conventional neural networks (ConvNets) for detecting facial deepfake images and videos. It examines their potential for improved generalization and explainability, especially with limited training data. By leveraging SSL ViTs for deepfake detection with modest data and partial fine-tuning, we find comparable adaptability to deepfake detection and explainability via the attention mechanism.
arXiv Detail & Related papers (2024-05-01T07:16:49Z)
T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers [9.284740716447342]
"Black box" nature of neural networks is a barrier to adoption in applications where explainability is essential. This paper presents T-TAME, Transformer-compatible Trainable Attention Mechanism for Explanations. Proposed architecture and training technique can be easily applied to any convolutional or Vision Transformer-like neural network.
arXiv Detail & Related papers (2024-03-07T14:25:03Z)
Experts Weights Averaging: A New General Training Scheme for Vision Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z)
RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z)
What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision. This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs. We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z)
When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture [32.260596998171835]
Adrial training is still required for ViTs to defend against such adversarial attacks. We find that pre-training and SGD are necessary for ViTs' adversarial training. Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
arXiv Detail & Related papers (2022-10-14T05:37:20Z)
DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z)
Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.