The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel
Size might be All You Need
- URL: http://arxiv.org/abs/2312.05695v2
- Date: Tue, 12 Dec 2023 18:23:42 GMT
- Title: The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel
Size might be All You Need
- Authors: Tianjin Huang, Tianlong Chen, Zhangyang Wang and Shiwei Liu
- Abstract summary: Vision Transformers have been rapidly uprising in computer vision thanks to their outstanding scaling trends, and gradually replacing convolutional neural networks (CNNs)
Recent works on self-supervised learning (SSL) introduce siamese pre-training tasks.
People come to believe that Transformers or self-attention modules are inherently more suitable than CNNs in the context of SSL.
- Score: 103.31261028244782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers have been rapidly uprising in computer vision thanks to
their outstanding scaling trends, and gradually replacing convolutional neural
networks (CNNs). Recent works on self-supervised learning (SSL) introduce
siamese pre-training tasks, on which Transformer backbones continue to
demonstrate ever stronger results than CNNs. People come to believe that
Transformers or self-attention modules are inherently more suitable than CNNs
in the context of SSL. However, it is noteworthy that most if not all prior
arts of SSL with CNNs chose the standard ResNets as their backbones, whose
architecture effectiveness is known to already lag behind advanced Vision
Transformers. Therefore, it remains unclear whether the self-attention
operation is crucial for the recent advances in SSL - or CNNs can deliver the
same excellence with more advanced designs, too? Can we close the SSL
performance gap between Transformers and CNNs? To answer these intriguing
questions, we apply self-supervised pre-training to the recently proposed,
stronger lager-kernel CNN architecture and conduct an apple-to-apple comparison
with Transformers, in their SSL performance. Our results show that we are able
to build pure CNN SSL architectures that perform on par with or better than the
best SSL-trained Transformers, by just scaling up convolutional kernel sizes
besides other small tweaks. Impressively, when transferring to the downstream
tasks \texttt{MS COCO} detection and segmentation, our SSL pre-trained CNN
model (trained in 100 epochs) achieves the same good performance as the
300-epoch pre-trained Transformer counterpart. We hope this work can help to
better understand what is essential (or not) for self-supervised learning
backbones.
Related papers
- OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve.
We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap.
This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z) - InternImage: Exploring Large-Scale Vision Foundation Models with
Deformable Convolutions [95.94629864981091]
This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs.
The proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs.
arXiv Detail & Related papers (2022-11-10T18:59:04Z) - Can CNNs Be More Robust Than Transformers? [29.615791409258804]
Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade.
Recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups.
It is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se.
arXiv Detail & Related papers (2022-06-07T17:17:07Z) - Are Transformers More Robust Than CNNs? [17.47001041042089]
We provide the first fair & in-depth comparisons between Transformers and CNNs.
CNNs can easily be as robust as Transformers on defending against adversarial attacks.
Our ablations suggest such stronger generalization is largely benefited by the Transformer's self-attention-like architectures.
arXiv Detail & Related papers (2021-11-10T00:18:59Z) - Investigating Transfer Learning Capabilities of Vision Transformers and
CNNs by Fine-Tuning a Single Trainable Block [0.0]
transformer-based architectures are surpassing the state-of-the-art set by CNN architectures in accuracy but are computationally very expensive to train from scratch.
We study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data.
We find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.
arXiv Detail & Related papers (2021-10-11T13:43:03Z) - A Battle of Network Structures: An Empirical Study of CNN, Transformer,
and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends.
In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z) - Transformed CNNs: recasting pre-trained convolutional layers with
self-attention [17.96659165573821]
Vision Transformers (ViT) have emerged as a powerful alternative to convolutional networks (CNNs)
In this work, we explore the idea of reducing the time spent training these layers by initializing them as convolutional layers.
With only 50 epochs of fine-tuning, the resulting T-CNNs demonstrate significant performance gains.
arXiv Detail & Related papers (2021-06-10T14:56:10Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - Curriculum By Smoothing [52.08553521577014]
Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation.
We propose an elegant curriculum based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters.
As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data.
arXiv Detail & Related papers (2020-03-03T07:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.