ConTNet: Why not use convolution and transformer at the same time?
- URL: http://arxiv.org/abs/2104.13497v1
- Date: Tue, 27 Apr 2021 22:29:55 GMT
- Title: ConTNet: Why not use convolution and transformer at the same time?
- Authors: Haotian Yan, Zhe Li, Weijian Li, Changhu Wang, Ming Wu, Chuang Zhang
- Abstract summary: We propose ConTNet, combining transformer with ConvNet architectures to provide large receptive fields.
We present its superiority and effectiveness on image classification and downstream tasks.
We hope that ConTNet could serve as a useful backbone for CV tasks and bring new ideas for model design.
- Score: 28.343371000297747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although convolutional networks (ConvNets) have enjoyed great success in
computer vision (CV), it suffers from capturing global information crucial to
dense prediction tasks such as object detection and segmentation. In this work,
we innovatively propose ConTNet (ConvolutionTransformer Network), combining
transformer with ConvNet architectures to provide large receptive fields.
Unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that
are sensitive to hyper-parameters and extremely dependent on a pile of data
augmentations when trained from scratch on a midsize dataset (e.g.,
ImageNet1k), ConTNet can be optimized like normal ConvNets (e.g., ResNet) and
preserve an outstanding robustness. It is also worth pointing that, given
identical strong data augmentations, the performance improvement of ConTNet is
more remarkable than that of ResNet. We present its superiority and
effectiveness on image classification and downstream tasks. For example, our
ConTNet achieves 81.8% top-1 accuracy on ImageNet which is the same as DeiT-B
with less than 40% computational complexity. ConTNet-M also outperforms
ResNet50 as the backbone of both Faster-RCNN (by 2.6%) and Mask-RCNN (by 3.2%)
on COCO2017 dataset. We hope that ConTNet could serve as a useful backbone for
CV tasks and bring new ideas for model design
Related papers
- Learning to Generate Parameters of ConvNets for Unseen Image Data [36.68392191824203]
ConvNets depend heavily on large amounts of image data and resort to an iterative optimization algorithm to learn network parameters.
We propose a new training paradigm and formulate the parameter learning of ConvNets into a prediction task.
We show that our proposed method achieves good efficacy for unseen image datasets on two kinds of settings.
arXiv Detail & Related papers (2023-10-18T10:26:18Z) - Are Large Kernels Better Teachers than Transformers for ConvNets? [82.4742785108714]
This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small- Kernel ConvNets.
arXiv Detail & Related papers (2023-05-30T21:05:23Z) - Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness
with Dataset Reinforcement [68.44100784364987]
We propose a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users.
We create a reinforced version of the ImageNet training dataset, called ImageNet+, as well as reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+.
Models trained with ImageNet+ are more accurate, robust, and calibrated, and transfer well to downstream tasks.
arXiv Detail & Related papers (2023-03-15T23:10:17Z) - MogaNet: Multi-order Gated Aggregation Network [64.16774341908365]
We propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning.
MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module.
MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet.
arXiv Detail & Related papers (2022-11-07T04:31:17Z) - Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs [35.39701561076837]
We propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC.
Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform.
Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets.
arXiv Detail & Related papers (2022-10-08T13:14:02Z) - EdgeFormer: Improving Light-weight ConvNets by Learning from Vision
Transformers [29.09883780571206]
We propose EdgeFormer, a pure ConvNet based backbone model.
We combine the global circular convolution (GCC) with position embeddings, a light-weight convolution op.
Experiment results show that the proposed EdgeFormer achieves better performance than popular light-weight ConvNets and vision transformer based models.
arXiv Detail & Related papers (2022-03-08T09:25:17Z) - Bottleneck Transformers for Visual Recognition [97.16013761605254]
We present BoTNet, a powerful backbone architecture that incorporates self-attention for vision tasks.
We present models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark.
We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.
arXiv Detail & Related papers (2021-01-27T18:55:27Z) - ResNet or DenseNet? Introducing Dense Shortcuts to ResNet [80.35001540483789]
This paper presents a unified perspective of dense summation to analyze them.
We propose dense weighted normalized shortcuts as a solution to the dilemma between ResNet and DenseNet.
Our proposed DSNet achieves significantly better results than ResNet, and achieves comparable performance as DenseNet but requiring fewer resources.
arXiv Detail & Related papers (2020-10-23T16:00:15Z) - DyNet: Dynamic Convolution for Accelerating Convolutional Neural
Networks [16.169176006544436]
We propose a novel dynamic convolution method to adaptively generate convolution kernels based on image contents.
Based on the architecture MobileNetV3-Small/Large, DyNet achieves 70.3/77.1% Top-1 accuracy on ImageNet with an improvement of 2.9/1.9%.
arXiv Detail & Related papers (2020-04-22T16:58:05Z) - Improved Residual Networks for Image and Video Recognition [98.10703825716142]
Residual networks (ResNets) represent a powerful type of convolutional neural network (CNN) architecture.
We show consistent improvements in accuracy and learning convergence over the baseline.
Our proposed approach allows us to train extremely deep networks, while the baseline shows severe optimization issues.
arXiv Detail & Related papers (2020-04-10T11:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.