Fcaformer: Forward Cross Attention in Hybrid Vision Transformer
- URL: http://arxiv.org/abs/2211.07198v2
- Date: Mon, 20 Mar 2023 03:43:27 GMT
- Title: Fcaformer: Forward Cross Attention in Hybrid Vision Transformer
- Authors: Haokui Zhang, Wenze Hu, Xiaoyu Wang
- Abstract summary: We propose forward cross attention for hybrid vision transformer (FcaFormer)
Our FcaFormer achieves 83.1% top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.6 billion MACs.
This saves almost half of the parameters and a few computational costs while achieving 0.7% higher accuracy compared to distilled EfficientFormer.
- Score: 29.09883780571206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Currently, one main research line in designing a more efficient vision
transformer is reducing the computational cost of self attention modules by
adopting sparse attention or using local attention windows. In contrast, we
propose a different approach that aims to improve the performance of
transformer-based architectures by densifying the attention pattern.
Specifically, we proposed forward cross attention for hybrid vision transformer
(FcaFormer), where tokens from previous blocks in the same stage are secondary
used. To achieve this, the FcaFormer leverages two innovative components:
learnable scale factors (LSFs) and a token merge and enhancement module (TME).
The LSFs enable efficient processing of cross tokens, while the TME generates
representative cross tokens. By integrating these components, the proposed
FcaFormer enhances the interactions of tokens across blocks with potentially
different semantics, and encourages more information flows to the lower levels.
Based on the forward cross attention (Fca), we have designed a series of
FcaFormer models that achieve the best trade-off between model size,
computational cost, memory cost, and accuracy. For example, without the need
for knowledge distillation to strengthen training, our FcaFormer achieves 83.1%
top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.6
billion MACs. This saves almost half of the parameters and a few computational
costs while achieving 0.7% higher accuracy compared to distilled
EfficientFormer.
Related papers
- Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching [30.272791354494373]
We introduce affine-based local attention to model cross-view deformations.
We also present selective fusion to merge local and global messages from cross attention.
arXiv Detail & Related papers (2024-05-22T17:57:37Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - ParFormer: Vision Transformer Baseline with Parallel Local Global Token Mixer and Convolution Attention Patch Embedding [3.4140488674588614]
ParFormer is an enhanced transformer architecture that allows the incorporation of different token mixers into a single stage.
We offer the Convolutional Attention Patch Embedding (CAPE) as an enhancement of standard patch embedding to improve token mixer extraction.
Our model variants with 11M, 23M, and 34M parameters achieve scores of 80.4%, 82.1%, and 83.1%, respectively.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient
Semantic Segmentation [0.0]
CNN-based U-Net has seen significant progress in high-resolution medical imaging and remote sensing.
This dual success inspired us to merge the strengths of both, leading to the inception of a U-Net-based vision transformer decoder.
We propose a novel transformer decoder, U-MixFormer, built upon the U-Net structure, designed for efficient semantic segmentation.
arXiv Detail & Related papers (2023-12-11T10:19:42Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Inception Transformer [151.939077819196]
Inception Transformer, or iFormer, learns comprehensive features with both high- and low-frequency information in visual data.
We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
arXiv Detail & Related papers (2022-05-25T17:59:54Z) - Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights.
Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.