Transformers For Recognition In Overhead Imagery: A Reality Check
- URL: http://arxiv.org/abs/2210.12599v1
- Date: Sun, 23 Oct 2022 02:17:31 GMT
- Title: Transformers For Recognition In Overhead Imagery: A Reality Check
- Authors: Francesco Luzi, Aneesh Gupta, Leslie Collins, Kyle Bradbury, Jordan
Malof
- Abstract summary: We compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery.
Our results suggest that transformers provide consistent, but modest, performance improvements.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is evidence that transformers offer state-of-the-art recognition
performance on tasks involving overhead imagery (e.g., satellite imagery).
However, it is difficult to make unbiased empirical comparisons between
competing deep learning models, making it unclear whether, and to what extent,
transformer-based models are beneficial. In this paper we systematically
compare the impact of adding transformer structures into state-of-the-art
segmentation models for overhead imagery. Each model is given a similar budget
of free parameters, and their hyperparameters are optimized using Bayesian
Optimization with a fixed quantity of data and computation time. We conduct our
experiments with a large and diverse dataset comprising two large public
benchmarks: Inria and DeepGlobe. We perform additional ablation studies to
explore the impact of specific transformer-based modeling choices. Our results
suggest that transformers provide consistent, but modest, performance
improvements. We only observe this advantage however in hybrid models that
combine convolutional and transformer-based structures, while fully
transformer-based models achieve relatively poor performance.
Related papers
- Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers [7.89533262149443]
Self-attention in Transformers comes with a high computational cost because of their quadratic computational complexity.
Our benchmark shows that using a larger model in general is more efficient than using higher resolution images.
arXiv Detail & Related papers (2023-08-18T08:06:49Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - Patch Similarity Aware Data-Free Quantization for Vision Transformers [2.954890575035673]
We propose PSAQ-ViT, a Patch Similarity Aware data-free Quantization framework for Vision Transformers.
We analyze the self-attention module's properties and reveal a general difference (patch similarity) in its processing of Gaussian noise and real images.
Experiments and ablation studies are conducted on various benchmarks to validate the effectiveness of PSAQ-ViT.
arXiv Detail & Related papers (2022-03-04T11:47:20Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Training Vision Transformers for Image Retrieval [32.09708181236154]
We adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective.
Our results show consistent and significant improvements of transformers over convolution-based approaches.
arXiv Detail & Related papers (2021-02-10T18:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.