Related papers: Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

URL: http://arxiv.org/abs/2210.05958v1
Date: Wed, 12 Oct 2022 06:54:39 GMT
Title: Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
Authors: Zhiying Lu, Hongtao Xie, Chuanbin Liu, Yongdong Zhang
Abstract summary: There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets. We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
Score: 91.25055890980084
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters. Code is available at https://github.com/ArieSeirack/DHVT.

Related papers

DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer [1.456352735394398]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)<n> Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.<n>These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases.
arXiv Detail & Related papers (2025-06-15T22:42:57Z)
CFFormer: Cross CNN-Transformer Channel Attention and Spatial Feature Fusion for Improved Segmentation of Low Quality Medical Images [29.68616115427831]
CNN-Transformer models are designed to combine the advantages of CNNs and Transformers to efficiently model both local information and long-range dependencies. We introduce the Cross Feature Channel Attention (CFCA) module and the X-Spatial Feature Fusion (XFF) module. The CFCA module filters and facilitates interactions between the channel features from the two encoders, while the XFF module effectively reduces the significant semantic information differences in spatial features.
arXiv Detail & Related papers (2025-01-07T08:59:20Z)
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z)
DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs) Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z)
TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z)
Plug n' Play: Channel Shuffle Module for Enhancing Tiny Vision Transformers [15.108494142240993]
Vision Transformers (ViTs) have demonstrated remarkable performance in various computer vision tasks. High computational complexity hinders ViTs' applicability on devices with limited memory and computing resources. We propose a novel channel shuffle module to improve tiny-size ViTs.
arXiv Detail & Related papers (2023-10-09T11:56:35Z)
Dual Aggregation Transformer for Image Super-Resolution [92.41781921611646]
We propose a novel Transformer model, Dual Aggregation Transformer, for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Our experiments show that our DAT surpasses current methods.
arXiv Detail & Related papers (2023-08-07T07:39:39Z)
Making Vision Transformers Truly Shift-Equivariant [20.61570323513044]
Vision Transformers (ViTs) have become one of the go-to deep net architectures for computer vision. We introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. We evaluate the proposed adaptive models on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2023-05-25T17:59:40Z)
LightViT: Towards Light-Weight Convolution-Free Vision Transformers [43.48734363817069]
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) We present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution. Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2022-07-12T14:27:57Z)
DaViT: Dual Attention Vision Transformers [94.62855697081079]
We introduce Dual Attention Vision Transformers (DaViT) DaViT is a vision transformer architecture that is able to capture global context while maintaining computational efficiency. We show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations.
arXiv Detail & Related papers (2022-04-07T17:59:32Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. Are they acting like convolutional networks, or learning entirely different visual representations? We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z)
Volumetric Transformer Networks [88.85542905676712]
We introduce a learnable module, the volumetric transformer network (VTN) VTN predicts channel-wise warping fields so as to reconfigure intermediate CNN features spatially and channel-wisely. Our experiments show that VTN consistently boosts the features' representation power and consequently the networks' accuracy on fine-grained image recognition and instance-level image retrieval.
arXiv Detail & Related papers (2020-07-18T14:00:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.