FoveaTer: Foveated Transformer for Image Classification
- URL: http://arxiv.org/abs/2105.14173v1
- Date: Sat, 29 May 2021 01:54:33 GMT
- Title: FoveaTer: Foveated Transformer for Image Classification
- Authors: Aditya Jonnalagadda, William Wang, Miguel P. Eckstein
- Abstract summary: We propose foveated Transformer (FoveaTer) model, which uses pooling regions and saccadic movements to perform object classification tasks.
We construct an ensemble model using our proposed model and unfoveated model, achieving an accuracy 1.36% below the unfoveated model with 22% computational savings.
- Score: 8.207403859762044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many animals and humans process the visual field with a varying spatial
resolution (foveated vision) and use peripheral processing to make eye
movements and point the fovea to acquire high-resolution information about
objects of interest. This architecture results in computationally efficient
rapid scene exploration. Recent progress in vision Transformers has brought
about new alternatives to the traditionally convolution-reliant computer vision
systems. However, these models do not explicitly model the foveated properties
of the visual system nor the interaction between eye movements and the
classification task. We propose foveated Transformer (FoveaTer) model, which
uses pooling regions and saccadic movements to perform object classification
tasks using a vision Transformer architecture. Our proposed model pools the
image features using squared pooling regions, an approximation to the
biologically-inspired foveated architecture, and uses the pooled features as an
input to a Transformer Network. It decides on the following fixation location
based on the attention assigned by the Transformer to various locations from
previous and present fixations. The model uses a confidence threshold to stop
scene exploration, allowing to dynamically allocate more fixation/computational
resources to more challenging images. We construct an ensemble model using our
proposed model and unfoveated model, achieving an accuracy 1.36% below the
unfoveated model with 22% computational savings. Finally, we demonstrate our
model's robustness against adversarial attacks, where it outperforms the
unfoveated model.
Related papers
- Transformers and Slot Encoding for Sample Efficient Physical World Modelling [1.5498250598583487]
We propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene.
We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the performance over the training examples.
arXiv Detail & Related papers (2024-05-30T15:48:04Z) - Transformers For Recognition In Overhead Imagery: A Reality Check [0.0]
We compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery.
Our results suggest that transformers provide consistent, but modest, performance improvements.
arXiv Detail & Related papers (2022-10-23T02:17:31Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - Optimizing Relevance Maps of Vision Transformers Improves Robustness [91.61353418331244]
It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes.
We propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object.
This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks.
arXiv Detail & Related papers (2022-06-02T17:24:48Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Swin-Pose: Swin Transformer Based Human Pose Estimation [16.247836509380026]
Convolutional neural networks (CNNs) have been widely utilized in many computer vision tasks.
CNNs have a fixed reception field and lack the ability of long-range perception, which is crucial to human pose estimation.
We propose a novel model based on transformer architecture, enhanced with a feature pyramid fusion structure.
arXiv Detail & Related papers (2022-01-19T02:15:26Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Generative Adversarial Transformers [13.633811200719627]
We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling.
The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency.
We show it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency.
arXiv Detail & Related papers (2021-03-01T18:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.