Related papers: Adaptive Split-Fusion Transformer

Adaptive Split-Fusion Transformer

URL: http://arxiv.org/abs/2204.12196v2
Date: Wed, 16 Aug 2023 17:09:41 GMT
Title: Adaptive Split-Fusion Transformer
Authors: Zixuan Su, Hao Zhang, Jingjing Chen, Lei Pang, Chong-Wah Ngo, Yu-Gang Jiang
Abstract summary: We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
Score: 90.04885335911729
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural networks for visual content understanding have recently evolved from convolutional ones (CNNs) to transformers. The prior (CNN) relies on small-windowed kernels to capture the regional clues, demonstrating solid local expressiveness. On the contrary, the latter (transformer) establishes long-range global connections between localities for holistic learning. Inspired by this complementary nature, there is a growing interest in designing hybrid models to best utilize each technique. Current hybrids merely replace convolutions as simple approximations of linear projection or juxtapose a convolution branch with attention, without concerning the importance of local/global modeling. To tackle this, we propose a new hybrid named Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Specifically, an ASF-former encoder equally splits feature channels into half to fit dual-path inputs. Then, the outputs of dual-path are fused with weighting scalars calculated from visual cues. We also design the convolutional path compactly for efficiency concerns. Extensive experiments on standard benchmarks, such as ImageNet-1K, CIFAR-10, and CIFAR-100, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy (83.9% on ImageNet-1K), under similar conditions (12.9G MACs/56.7M Params, without large-scale pre-training). The code is available at: https://github.com/szx503045266/ASF-former.

Related papers

CTRL-F: Pairing Convolution with Transformer for Image Classification via Multi-Level Feature Cross-Attention and Representation Learning Fusion [0.0]
We present a novel lightweight hybrid network that pairs Convolution with Transformers. We fuse the local responses acquired from the convolution path with the global responses acquired from the MFCA module. Experiments demonstrate that our variants achieve state-of-the-art performance, whether trained from scratch on large data or even with low-data regime.
arXiv Detail & Related papers (2024-07-09T08:47:13Z)
TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z)
ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation [10.727162449071155]
We build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance. In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction.
arXiv Detail & Related papers (2023-09-09T02:18:17Z)
TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for Medical Image Segmentation [20.976167468217387]
We propose vision Transformer embrace convolutional neural networks for medical image segmentation (TEC-Net) Our network has two advantages. First, dynamic deformable convolution (DDConv) is designed in the CNN branch, which not only overcomes the difficulty of adaptive feature extraction using fixed-size convolution kernels, but also solves the defect that different inputs share the same convolution kernel parameters. Experimental results show that the proposed TEC-Net provides better medical image segmentation results than SOTA methods including CNN and Transformer networks.
arXiv Detail & Related papers (2023-06-07T01:14:16Z)
CiT-Net: Convolutional Neural Networks Hand in Hand with Vision Transformers for Medical Image Segmentation [10.20771849219059]
We propose a novel hybrid architecture of convolutional neural networks (CNNs) and vision Transformers (CiT-Net) for medical image segmentation. Our CiT-Net provides better medical image segmentation results than popular SOTA methods.
arXiv Detail & Related papers (2023-06-06T03:22:22Z)
Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information. Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z)
nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z)
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z)
An Attention Free Transformer [22.789683304721276]
We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query. We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.
arXiv Detail & Related papers (2021-05-28T20:45:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.