InternImage: Exploring Large-Scale Vision Foundation Models with
Deformable Convolutions
- URL: http://arxiv.org/abs/2211.05778v4
- Date: Mon, 17 Apr 2023 11:51:12 GMT
- Title: InternImage: Exploring Large-Scale Vision Foundation Models with
Deformable Convolutions
- Authors: Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou
Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao
- Abstract summary: This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs.
The proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs.
- Score: 95.94629864981091
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compared to the great progress of large-scale vision transformers (ViTs) in
recent years, large-scale models based on convolutional neural networks (CNNs)
are still in an early state. This work presents a new large-scale CNN-based
foundation model, termed InternImage, which can obtain the gain from increasing
parameters and training data like ViTs. Different from the recent CNNs that
focus on large dense kernels, InternImage takes deformable convolution as the
core operator, so that our model not only has the large effective receptive
field required for downstream tasks such as detection and segmentation, but
also has the adaptive spatial aggregation conditioned by input and task
information. As a result, the proposed InternImage reduces the strict inductive
bias of traditional CNNs and makes it possible to learn stronger and more
robust patterns with large-scale parameters from massive data like ViTs. The
effectiveness of our model is proven on challenging benchmarks including
ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved
a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming
current leading CNNs and ViTs. The code will be released at
https://github.com/OpenGVLab/InternImage.
Related papers
- OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve.
We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap.
This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z) - Lightweight Real-time Semantic Segmentation Network with Efficient
Transformer and CNN [34.020978009518245]
We propose a lightweight real-time semantic segmentation network called LETNet.
LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies.
Experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance.
arXiv Detail & Related papers (2023-02-21T07:16:53Z) - ConvFormer: Closing the Gap Between CNN and Vision Transformers [12.793893108426742]
We propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes.
Based on MCA, we present a neural network named ConvFormer.
We show ConvFormer outperforms similar size vision transformers(ViTs) and convolutional neural networks (CNNs) in various tasks.
arXiv Detail & Related papers (2022-09-16T06:45:01Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs [148.0476219278875]
We revisit large kernel design in modern convolutional neural networks (CNNs)
Inspired by recent advances of vision transformers (ViTs), in this paper, we demonstrate that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm.
We propose RepLKNet, a pure CNN architecture whose kernel size is as large as 31x31, in contrast to commonly used 3x3.
arXiv Detail & Related papers (2022-03-13T17:22:44Z) - Training Vision Transformers with Only 2040 Images [35.86457465241119]
Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition.
We give theoretical analyses that our method is superior to other methods in that it can capture both feature alignment and instance similarities.
We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones.
arXiv Detail & Related papers (2022-01-26T03:22:08Z) - Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs)
We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different.
We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - On the Performance of Convolutional Neural Networks under High and Low
Frequency Information [13.778851745408133]
We study the performance of CNN models over the high and low frequency information of the images.
We propose the filtering based data augmentation during training.
A satisfactory performance improvement has been observed in terms of robustness and low frequency generalization.
arXiv Detail & Related papers (2020-10-30T17:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.