Related papers: Transformed CNNs: recasting pre-trained convolutional layers with self-attention

Transformed CNNs: recasting pre-trained convolutional layers with self-attention

URL: http://arxiv.org/abs/2106.05795v1
Date: Thu, 10 Jun 2021 14:56:10 GMT
Title: Transformed CNNs: recasting pre-trained convolutional layers with self-attention
Authors: St\'ephane d'Ascoli, Levent Sagun, Giulio Biroli, Ari Morcos
Abstract summary: Vision Transformers (ViT) have emerged as a powerful alternative to convolutional networks (CNNs) In this work, we explore the idea of reducing the time spent training these layers by initializing them as convolutional layers. With only 50 epochs of fine-tuning, the resulting T-CNNs demonstrate significant performance gains.
Score: 17.96659165573821
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViT) have recently emerged as a powerful alternative to convolutional networks (CNNs). Although hybrid models attempt to bridge the gap between these two architectures, the self-attention layers they rely on induce a strong computational bottleneck, especially at large spatial resolutions. In this work, we explore the idea of reducing the time spent training these layers by initializing them as convolutional layers. This enables us to transition smoothly from any pre-trained CNN to its functionally identical hybrid model, called Transformed CNN (T-CNN). With only 50 epochs of fine-tuning, the resulting T-CNNs demonstrate significant performance gains over the CNN (+2.2% top-1 on ImageNet-1k for a ResNet50-RS) as well as substantially improved robustness (+11% top-1 on ImageNet-C). We analyze the representations learnt by the T-CNN, providing deeper insights into the fruitful interplay between convolutions and self-attention. Finally, we experiment initializing the T-CNN from a partially trained CNN, and find that it reaches better performance than the corresponding hybrid model trained from scratch, while reducing training time.

Related papers

Model Parallel Training and Transfer Learning for Convolutional Neural Networks by Domain Decomposition [0.0]
Deep convolutional neural networks (CNNs) have been shown to be very successful in a wide range of image processing applications. Due to their increasing number of model parameters and an increasing availability of large amounts of training data, parallelization strategies to efficiently train complex CNNs are necessary.
arXiv Detail & Related papers (2024-08-26T17:35:01Z)
CNN2GNN: How to Bridge CNN with GNN [59.42117676779735]
We propose a novel CNN2GNN framework to unify CNN and GNN together via distillation. The performance of distilled boosted'' two-layer GNN on Mini-ImageNet is much higher than CNN containing dozens of layers such as ResNet152.
arXiv Detail & Related papers (2024-04-23T08:19:08Z)
OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z)
The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel Size might be All You Need [103.31261028244782]
Vision Transformers have been rapidly uprising in computer vision thanks to their outstanding scaling trends, and gradually replacing convolutional neural networks (CNNs) Recent works on self-supervised learning (SSL) introduce siamese pre-training tasks. People come to believe that Transformers or self-attention modules are inherently more suitable than CNNs in the context of SSL.
arXiv Detail & Related papers (2023-12-09T22:23:57Z)
An automated approach for improving the inference latency and energy efficiency of pretrained CNNs by removing irrelevant pixels with focused convolutions [0.8706730566331037]
We propose a novel, automated method to make a pretrained CNN more energy-efficient without re-training. Our modified focused convolution operation saves inference latency (by up to 25%) and energy costs (by up to 22%) on various popular pretrained CNNs.
arXiv Detail & Related papers (2023-10-11T18:07:37Z)
Transferability of Convolutional Neural Networks in Stationary Learning Tasks [96.00428692404354]
We introduce a novel framework for efficient training of convolutional neural networks (CNNs) for large-scale spatial problems. We show that a CNN trained on small windows of such signals achieves a nearly performance on much larger windows without retraining. Our results show that the CNN is able to tackle problems with many hundreds of agents after being trained with fewer than ten.
arXiv Detail & Related papers (2023-07-21T13:51:45Z)
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [95.94629864981091]
This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. The proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs.
arXiv Detail & Related papers (2022-11-10T18:59:04Z)
Patching Weak Convolutional Neural Network Models through Modularization and Composition [19.986199290508925]
A convolutional neuron network (CNN) model for classification tasks often performs unsatisfactorily. We propose a compressed modularization approach, CNNSplitter, which decomposes a strong CNN model for $N$-class classification into $N$ smaller CNN modules. We show that CNNSplitter can patch a weak CNN model through modularization and composition, thus providing a new solution for developing robust CNN models.
arXiv Detail & Related papers (2022-09-11T15:26:16Z)
Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement [53.47564132861866]
We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality.
arXiv Detail & Related papers (2020-07-25T22:21:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.