Adaptive Split-Fusion Transformer
- URL: http://arxiv.org/abs/2204.12196v2
- Date: Wed, 16 Aug 2023 17:09:41 GMT
- Title: Adaptive Split-Fusion Transformer
- Authors: Zixuan Su, Hao Zhang, Jingjing Chen, Lei Pang, Chong-Wah Ngo, Yu-Gang
Jiang
- Abstract summary: We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights.
Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
- Score: 90.04885335911729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks for visual content understanding have recently evolved from
convolutional ones (CNNs) to transformers. The prior (CNN) relies on
small-windowed kernels to capture the regional clues, demonstrating solid local
expressiveness. On the contrary, the latter (transformer) establishes
long-range global connections between localities for holistic learning.
Inspired by this complementary nature, there is a growing interest in designing
hybrid models to best utilize each technique. Current hybrids merely replace
convolutions as simple approximations of linear projection or juxtapose a
convolution branch with attention, without concerning the importance of
local/global modeling. To tackle this, we propose a new hybrid named Adaptive
Split-Fusion Transformer (ASF-former) to treat convolutional and attention
branches differently with adaptive weights. Specifically, an ASF-former encoder
equally splits feature channels into half to fit dual-path inputs. Then, the
outputs of dual-path are fused with weighting scalars calculated from visual
cues. We also design the convolutional path compactly for efficiency concerns.
Extensive experiments on standard benchmarks, such as ImageNet-1K, CIFAR-10,
and CIFAR-100, show that our ASF-former outperforms its CNN, transformer
counterparts, and hybrid pilots in terms of accuracy (83.9% on ImageNet-1K),
under similar conditions (12.9G MACs/56.7M Params, without large-scale
pre-training). The code is available at:
https://github.com/szx503045266/ASF-former.
Related papers
- CTRL-F: Pairing Convolution with Transformer for Image Classification via Multi-Level Feature Cross-Attention and Representation Learning Fusion [0.0]
We present a novel lightweight hybrid network that pairs Convolution with Transformers.
We fuse the local responses acquired from the convolution path with the global responses acquired from the MFCA module.
Experiments demonstrate that our variants achieve state-of-the-art performance, whether trained from scratch on large data or even with low-data regime.
arXiv Detail & Related papers (2024-07-09T08:47:13Z) - TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic
Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way.
We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network.
In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z) - ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical
Image Segmentation [10.727162449071155]
We build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance.
In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction.
arXiv Detail & Related papers (2023-09-09T02:18:17Z) - TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for
Medical Image Segmentation [20.976167468217387]
We propose vision Transformer embrace convolutional neural networks for medical image segmentation (TEC-Net)
Our network has two advantages. First, dynamic deformable convolution (DDConv) is designed in the CNN branch, which not only overcomes the difficulty of adaptive feature extraction using fixed-size convolution kernels, but also solves the defect that different inputs share the same convolution kernel parameters.
Experimental results show that the proposed TEC-Net provides better medical image segmentation results than SOTA methods including CNN and Transformer networks.
arXiv Detail & Related papers (2023-06-07T01:14:16Z) - CiT-Net: Convolutional Neural Networks Hand in Hand with Vision
Transformers for Medical Image Segmentation [10.20771849219059]
We propose a novel hybrid architecture of convolutional neural networks (CNNs) and vision Transformers (CiT-Net) for medical image segmentation.
Our CiT-Net provides better medical image segmentation results than popular SOTA methods.
arXiv Detail & Related papers (2023-06-06T03:22:22Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - A Battle of Network Structures: An Empirical Study of CNN, Transformer,
and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends.
In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - An Attention Free Transformer [22.789683304721276]
We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention.
In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query.
We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.
arXiv Detail & Related papers (2021-05-28T20:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.