Fcaformer: Forward Cross Attention in Hybrid Vision Transformer
- URL: http://arxiv.org/abs/2211.07198v2
- Date: Mon, 20 Mar 2023 03:43:27 GMT
- Title: Fcaformer: Forward Cross Attention in Hybrid Vision Transformer
- Authors: Haokui Zhang, Wenze Hu, Xiaoyu Wang
- Abstract summary: We propose forward cross attention for hybrid vision transformer (FcaFormer)
Our FcaFormer achieves 83.1% top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.6 billion MACs.
This saves almost half of the parameters and a few computational costs while achieving 0.7% higher accuracy compared to distilled EfficientFormer.
- Score: 29.09883780571206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Currently, one main research line in designing a more efficient vision
transformer is reducing the computational cost of self attention modules by
adopting sparse attention or using local attention windows. In contrast, we
propose a different approach that aims to improve the performance of
transformer-based architectures by densifying the attention pattern.
Specifically, we proposed forward cross attention for hybrid vision transformer
(FcaFormer), where tokens from previous blocks in the same stage are secondary
used. To achieve this, the FcaFormer leverages two innovative components:
learnable scale factors (LSFs) and a token merge and enhancement module (TME).
The LSFs enable efficient processing of cross tokens, while the TME generates
representative cross tokens. By integrating these components, the proposed
FcaFormer enhances the interactions of tokens across blocks with potentially
different semantics, and encourages more information flows to the lower levels.
Based on the forward cross attention (Fca), we have designed a series of
FcaFormer models that achieve the best trade-off between model size,
computational cost, memory cost, and accuracy. For example, without the need
for knowledge distillation to strengthen training, our FcaFormer achieves 83.1%
top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.6
billion MACs. This saves almost half of the parameters and a few computational
costs while achieving 0.7% higher accuracy compared to distilled
EfficientFormer.
Related papers
- CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - CTRL-F: Pairing Convolution with Transformer for Image Classification via Multi-Level Feature Cross-Attention and Representation Learning Fusion [0.0]
We present a novel lightweight hybrid network that pairs Convolution with Transformers.
We fuse the local responses acquired from the convolution path with the global responses acquired from the MFCA module.
Experiments demonstrate that our variants achieve state-of-the-art performance, whether trained from scratch on large data or even with low-data regime.
arXiv Detail & Related papers (2024-07-09T08:47:13Z) - Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching [30.272791354494373]
We introduce affine-based local attention to model cross-view deformations.
We also present selective fusion to merge local and global messages from cross attention.
arXiv Detail & Related papers (2024-05-22T17:57:37Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient
Semantic Segmentation [0.0]
CNN-based U-Net has seen significant progress in high-resolution medical imaging and remote sensing.
This dual success inspired us to merge the strengths of both, leading to the inception of a U-Net-based vision transformer decoder.
We propose a novel transformer decoder, U-MixFormer, built upon the U-Net structure, designed for efficient semantic segmentation.
arXiv Detail & Related papers (2023-12-11T10:19:42Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights.
Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.