Related papers: ConTNet: Why not use convolution and transformer at the same time?

ConTNet: Why not use convolution and transformer at the same time?

URL: http://arxiv.org/abs/2104.13497v1
Date: Tue, 27 Apr 2021 22:29:55 GMT
Title: ConTNet: Why not use convolution and transformer at the same time?
Authors: Haotian Yan, Zhe Li, Weijian Li, Changhu Wang, Ming Wu, Chuang Zhang
Abstract summary: We propose ConTNet, combining transformer with ConvNet architectures to provide large receptive fields. We present its superiority and effectiveness on image classification and downstream tasks. We hope that ConTNet could serve as a useful backbone for CV tasks and bring new ideas for model design.
Score: 28.343371000297747
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although convolutional networks (ConvNets) have enjoyed great success in computer vision (CV), it suffers from capturing global information crucial to dense prediction tasks such as object detection and segmentation. In this work, we innovatively propose ConTNet (ConvolutionTransformer Network), combining transformer with ConvNet architectures to provide large receptive fields. Unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that are sensitive to hyper-parameters and extremely dependent on a pile of data augmentations when trained from scratch on a midsize dataset (e.g., ImageNet1k), ConTNet can be optimized like normal ConvNets (e.g., ResNet) and preserve an outstanding robustness. It is also worth pointing that, given identical strong data augmentations, the performance improvement of ConTNet is more remarkable than that of ResNet. We present its superiority and effectiveness on image classification and downstream tasks. For example, our ConTNet achieves 81.8% top-1 accuracy on ImageNet which is the same as DeiT-B with less than 40% computational complexity. ConTNet-M also outperforms ResNet50 as the backbone of both Faster-RCNN (by 2.6%) and Mask-RCNN (by 3.2%) on COCO2017 dataset. We hope that ConTNet could serve as a useful backbone for CV tasks and bring new ideas for model design

Related papers

Learning to Generate Parameters of ConvNets for Unseen Image Data [36.68392191824203]
ConvNets depend heavily on large amounts of image data and resort to an iterative optimization algorithm to learn network parameters. We propose a new training paradigm and formulate the parameter learning of ConvNets into a prediction task. We show that our proposed method achieves good efficacy for unseen image datasets on two kinds of settings.
arXiv Detail & Related papers (2023-10-18T10:26:18Z)
Are Large Kernels Better Teachers than Transformers for ConvNets? [82.4742785108714]
This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small- Kernel ConvNets.
arXiv Detail & Related papers (2023-05-30T21:05:23Z)
Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement [68.44100784364987]
We propose a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users. We create a reinforced version of the ImageNet training dataset, called ImageNet+, as well as reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+. Models trained with ImageNet+ are more accurate, robust, and calibrated, and transfer well to downstream tasks.
arXiv Detail & Related papers (2023-03-15T23:10:17Z)
MogaNet: Multi-order Gated Aggregation Network [64.16774341908365]
We propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet.
arXiv Detail & Related papers (2022-11-07T04:31:17Z)
Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs [35.39701561076837]
We propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC. Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform. Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets.
arXiv Detail & Related papers (2022-10-08T13:14:02Z)
EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers [29.09883780571206]
We propose EdgeFormer, a pure ConvNet based backbone model. We combine the global circular convolution (GCC) with position embeddings, a light-weight convolution op. Experiment results show that the proposed EdgeFormer achieves better performance than popular light-weight ConvNets and vision transformer based models.
arXiv Detail & Related papers (2022-03-08T09:25:17Z)
Bottleneck Transformers for Visual Recognition [97.16013761605254]
We present BoTNet, a powerful backbone architecture that incorporates self-attention for vision tasks. We present models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.
arXiv Detail & Related papers (2021-01-27T18:55:27Z)
ResNet or DenseNet? Introducing Dense Shortcuts to ResNet [80.35001540483789]
This paper presents a unified perspective of dense summation to analyze them. We propose dense weighted normalized shortcuts as a solution to the dilemma between ResNet and DenseNet. Our proposed DSNet achieves significantly better results than ResNet, and achieves comparable performance as DenseNet but requiring fewer resources.
arXiv Detail & Related papers (2020-10-23T16:00:15Z)
DyNet: Dynamic Convolution for Accelerating Convolutional Neural Networks [16.169176006544436]
We propose a novel dynamic convolution method to adaptively generate convolution kernels based on image contents. Based on the architecture MobileNetV3-Small/Large, DyNet achieves 70.3/77.1% Top-1 accuracy on ImageNet with an improvement of 2.9/1.9%.
arXiv Detail & Related papers (2020-04-22T16:58:05Z)
Improved Residual Networks for Image and Video Recognition [98.10703825716142]
Residual networks (ResNets) represent a powerful type of convolutional neural network (CNN) architecture. We show consistent improvements in accuracy and learning convergence over the baseline. Our proposed approach allows us to train extremely deep networks, while the baseline shows severe optimization issues.
arXiv Detail & Related papers (2020-04-10T11:09:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.