Vision Pair Learning: An Efficient Training Framework for Image
Classification
- URL: http://arxiv.org/abs/2112.00965v1
- Date: Thu, 2 Dec 2021 03:45:16 GMT
- Title: Vision Pair Learning: An Efficient Training Framework for Image
Classification
- Authors: Bei Tong and Xiaoyuan Yu
- Abstract summary: Transformer and CNN are complementary in representation learning and convergence speed.
Vision Pair Learning (VPL) builds up a network composed of a transformer branch, a CNN branch and pair learning module.
VPL promotes the top-1 accuracy of ViT-Base and ResNet-50 on the ImageNet-1k validation set to 83.47% and 79.61% respectively.
- Score: 0.8223798883838329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer is a potentially powerful architecture for vision tasks. Although
equipped with more parameters and attention mechanism, its performance is not
as dominant as CNN currently. CNN is usually computationally cheaper and still
the leading competitor in various vision tasks. One research direction is to
adopt the successful ideas of CNN and improve transformer, but it often relies
on elaborated and heuristic network design. Observing that transformer and CNN
are complementary in representation learning and convergence speed, we propose
an efficient training framework called Vision Pair Learning (VPL) for image
classification task. VPL builds up a network composed of a transformer branch,
a CNN branch and pair learning module. With multi-stage training strategy, VPL
enables the branches to learn from their partners during the appropriate stage
of the training process, and makes them both achieve better performance with
less time cost. Without external data, VPL promotes the top-1 accuracy of
ViT-Base and ResNet-50 on the ImageNet-1k validation set to 83.47% and 79.61%
respectively. Experiments on other datasets of various domains prove the
efficacy of VPL and suggest that transformer performs better when paired with
the differently structured CNN in VPL. we also analyze the importance of
components through ablation study.
Related papers
- Transfer Learning for Microstructure Segmentation with CS-UNet: A Hybrid
Algorithm with Transformer and CNN Encoders [0.2353157426758003]
We compare the segmentation performance of Transformer and CNN models pre-trained on microscopy images with those pre-trained on natural images.
We also find that for image segmentation, the combination of pre-trained Transformers and CNN encoders are consistently better than pre-trained CNN encoders alone.
arXiv Detail & Related papers (2023-08-26T16:56:15Z) - How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - ConvFormer: Closing the Gap Between CNN and Vision Transformers [12.793893108426742]
We propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes.
Based on MCA, we present a neural network named ConvFormer.
We show ConvFormer outperforms similar size vision transformers(ViTs) and convolutional neural networks (CNNs) in various tasks.
arXiv Detail & Related papers (2022-09-16T06:45:01Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - Training Vision Transformers with Only 2040 Images [35.86457465241119]
Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition.
We give theoretical analyses that our method is superior to other methods in that it can capture both feature alignment and instance similarities.
We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones.
arXiv Detail & Related papers (2022-01-26T03:22:08Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs)
We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different.
We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - Curriculum By Smoothing [52.08553521577014]
Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation.
We propose an elegant curriculum based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters.
As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data.
arXiv Detail & Related papers (2020-03-03T07:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.