TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic
Token Mixer for Visual Recognition
- URL: http://arxiv.org/abs/2310.19380v2
- Date: Thu, 30 Nov 2023 01:48:03 GMT
- Title: TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic
Token Mixer for Visual Recognition
- Authors: Meng Lou, Hong-Yu Zhou, Sibei Yang, Yizhou Yu
- Abstract summary: We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way.
We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network.
In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
- Score: 71.6546914957701
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have integrated convolution into transformers to introduce
inductive bias and improve generalization performance. However, the static
nature of conventional convolution prevents it from dynamically adapting to
input variations, resulting in a representation discrepancy between convolution
and self-attention as self-attention calculates attention matrices dynamically.
Furthermore, when stacking token mixers that consist of convolution and
self-attention to form a deep network, the static nature of convolution hinders
the fusion of features previously generated by self-attention into convolution
kernels. These two limitations result in a sub-optimal representation capacity
of the constructed networks. To find a solution, we propose a lightweight Dual
Dynamic Token Mixer (D-Mixer) that aggregates global information and local
details in an input-dependent way. D-Mixer works by applying an efficient
global attention module and an input-dependent depthwise convolution separately
on evenly split feature segments, endowing the network with strong inductive
bias and an enlarged effective receptive field. We use D-Mixer as the basic
building block to design TransXNet, a novel hybrid CNN-Transformer vision
backbone network that delivers compelling performance. In the ImageNet-1K image
classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy
while requiring less than half of the computational cost. Furthermore,
TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving
top-1 accuracy of 83.8% and 84.6% respectively, with reasonable computational
costs. Additionally, our proposed network architecture demonstrates strong
generalization capabilities in various dense prediction tasks, outperforming
other state-of-the-art networks while having lower computational costs. Code is
available at https://github.com/LMMMEng/TransXNet.
Related papers
- Convolution and Attention Mixer for Synthetic Aperture Radar Image
Change Detection [41.38587746899477]
Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community.
Existing SAR change detection methods are mainly based on convolutional neural networks (CNNs)
We propose a convolution and attention mixer (CAMixer) to incorporate global attention.
arXiv Detail & Related papers (2023-09-21T12:28:23Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights.
Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z) - XnODR and XnIDR: Two Accurate and Fast Fully Connected Layers For
Convolutional Neural Networks [43.85390451313721]
Capsule Network is powerful at defining the positional relationship between features in deep neural networks for visual recognition tasks.
The bottleneck is in the computational complexity of the Dynamic Routing mechanism used between the capsules.
XnODR and XnIDR help networks to have high accuracy with lower FLOPs and fewer parameters.
arXiv Detail & Related papers (2021-11-21T16:42:01Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - X-volution: On the unification of convolution and self-attention [52.80459687846842]
We propose a multi-branch elementary module composed of both convolution and self-attention operation.
The proposed X-volution achieves highly competitive visual understanding improvements.
arXiv Detail & Related papers (2021-06-04T04:32:02Z) - MUXConv: Information Multiplexing in Convolutional Neural Networks [25.284420772533572]
MUXConv is designed to increase the flow of information by progressively multiplexing channel and spatial information in the network.
On ImageNet, the resulting models, dubbed MUXNets, match the performance (75.3% top-1 accuracy) and multiply-add operations (218M) of MobileNetV3.
MUXNet also performs well under transfer learning and when adapted to object detection.
arXiv Detail & Related papers (2020-03-31T00:09:47Z) - ReActNet: Towards Precise Binary Neural Network with Generalized
Activation Functions [76.05981545084738]
We propose several ideas for enhancing a binary network to close its accuracy gap from real-valued networks without incurring any additional computational cost.
We first construct a baseline network by modifying and binarizing a compact real-valued network with parameter-free shortcuts.
We show that the proposed ReActNet outperforms all the state-of-the-arts by a large margin.
arXiv Detail & Related papers (2020-03-07T02:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.