Co-Scale Conv-Attentional Image Transformers
- URL: http://arxiv.org/abs/2104.06399v1
- Date: Tue, 13 Apr 2021 17:58:29 GMT
- Title: Co-Scale Conv-Attentional Image Transformers
- Authors: Weijian Xu, Yifan Xu, Tyler Chang, Zhuowen Tu
- Abstract summary: Co-scale conv-attentional image Transformers (CoaT) are a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms.
On ImageNet, relatively small CoaT models attain superior classification results compared with the similar-sized convolutional neural networks and image/vision Transformers.
- Score: 22.834316796018705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present Co-scale conv-attentional image Transformers
(CoaT), a Transformer-based image classifier equipped with co-scale and
conv-attentional mechanisms. First, the co-scale mechanism maintains the
integrity of Transformers' encoder branches at individual scales, while
allowing representations learned at different scales to effectively communicate
with each other; we design a series of serial and parallel blocks to realize
the co-scale attention mechanism. Second, we devise a conv-attentional
mechanism by realizing a relative position embedding formulation in the
factorized attention module with an efficient convolution-like implementation.
CoaT empowers image Transformers with enriched multi-scale and contextual
modeling capabilities. On ImageNet, relatively small CoaT models attain
superior classification results compared with the similar-sized convolutional
neural networks and image/vision Transformers. The effectiveness of CoaT's
backbone is also illustrated on object detection and instance segmentation,
demonstrating its applicability to the downstream computer vision tasks.
Related papers
- A Computationally Efficient Multidimensional Vision Transformer [0.0]
Vision Transformers have achieved state-of-the-art performance in a wide range of computer vision tasks, but their practical deployment is limited by high computational and memory costs.<n>We introduce a novel tensor-based framework for Vision Transformers built upon the Cosine Product (Cproduct)
arXiv Detail & Related papers (2026-02-23T15:49:46Z) - Exploring the Integration of Key-Value Attention Into Pure and Hybrid Transformers for Semantic Segmentation [0.0]
KV Transformer shows promising results in synthetic, NLP, and image classification tasks.
This is especially conducive to use cases where local inference is required, such as medical screening applications.
arXiv Detail & Related papers (2025-03-24T16:38:31Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Towards End-to-End Image Compression and Analysis with Transformers [99.50111380056043]
We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application.
We aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer.
Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.
arXiv Detail & Related papers (2021-12-17T03:28:14Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Training Vision Transformers for Image Retrieval [32.09708181236154]
We adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective.
Our results show consistent and significant improvements of transformers over convolution-based approaches.
arXiv Detail & Related papers (2021-02-10T18:56:41Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.