Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets
- URL: http://arxiv.org/abs/2210.05958v1
- Date: Wed, 12 Oct 2022 06:54:39 GMT
- Title: Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets
- Authors: Zhiying Lu, Hongtao Xie, Chuanbin Liu, Yongdong Zhang
- Abstract summary: There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
- Score: 91.25055890980084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There still remains an extreme performance gap between Vision Transformers
(ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on
small datasets, which is concluded to the lack of inductive bias. In this
paper, we further consider this problem and point out two weaknesses of ViTs in
inductive biases, that is, the spatial relevance and diverse channel
representation. First, on spatial aspect, objects are locally compact and
relevant, thus fine-grained feature needs to be extracted from a token and its
neighbors. While the lack of data hinders ViTs to attend the spatial relevance.
Second, on channel aspect, representation exhibits diversity on different
channels. But the scarce data can not enable ViTs to learn strong enough
representation for accurate recognition. To this end, we propose Dynamic Hybrid
Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
On spatial aspect, we adopt a hybrid structure, in which convolution is
integrated into patch embedding and multi-layer perceptron module, forcing the
model to capture the token features as well as their neighboring features. On
channel aspect, we introduce a dynamic feature aggregation module in MLP and a
brand new "head token" design in multi-head self-attention module to help
re-calibrate channel representation and make different channel group
representation interacts with each other. The fusion of weak channel
representation forms a strong enough representation for classification. With
this design, we successfully eliminate the performance gap between CNNs and
ViTs, and our DHVT achieves a series of state-of-the-art performance with a
lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on
ImageNet-1K with 24.0M parameters. Code is available at
https://github.com/ArieSeirack/DHVT.
Related papers
- DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic
Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way.
We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network.
In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z) - Plug n' Play: Channel Shuffle Module for Enhancing Tiny Vision
Transformers [15.108494142240993]
Vision Transformers (ViTs) have demonstrated remarkable performance in various computer vision tasks.
High computational complexity hinders ViTs' applicability on devices with limited memory and computing resources.
We propose a novel channel shuffle module to improve tiny-size ViTs.
arXiv Detail & Related papers (2023-10-09T11:56:35Z) - Dual Aggregation Transformer for Image Super-Resolution [92.41781921611646]
We propose a novel Transformer model, Dual Aggregation Transformer, for image SR.
Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner.
Our experiments show that our DAT surpasses current methods.
arXiv Detail & Related papers (2023-08-07T07:39:39Z) - Making Vision Transformers Truly Shift-Equivariant [20.61570323513044]
Vision Transformers (ViTs) have become one of the go-to deep net architectures for computer vision.
We introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding.
We evaluate the proposed adaptive models on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2023-05-25T17:59:40Z) - LightViT: Towards Light-Weight Convolution-Free Vision Transformers [43.48734363817069]
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs)
We present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution.
Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2022-07-12T14:27:57Z) - DaViT: Dual Attention Vision Transformers [94.62855697081079]
We introduce Dual Attention Vision Transformers (DaViT)
DaViT is a vision transformer architecture that is able to capture global context while maintaining computational efficiency.
We show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations.
arXiv Detail & Related papers (2022-04-07T17:59:32Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - Volumetric Transformer Networks [88.85542905676712]
We introduce a learnable module, the volumetric transformer network (VTN)
VTN predicts channel-wise warping fields so as to reconfigure intermediate CNN features spatially and channel-wisely.
Our experiments show that VTN consistently boosts the features' representation power and consequently the networks' accuracy on fine-grained image recognition and instance-level image retrieval.
arXiv Detail & Related papers (2020-07-18T14:00:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.