Related papers: Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block

Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block

URL: http://arxiv.org/abs/2110.05270v1
Date: Mon, 11 Oct 2021 13:43:03 GMT
Title: Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block
Authors: Durvesh Malpure, Onkar Litake, Rajesh Ingle
Abstract summary: transformer-based architectures are surpassing the state-of-the-art set by CNN architectures in accuracy but are computationally very expensive to train from scratch. We study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data. We find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent developments in the field of Computer Vision, a rise is seen in the use of transformer-based architectures. They are surpassing the state-of-the-art set by CNN architectures in accuracy but on the other hand, they are computationally very expensive to train from scratch. As these models are quite recent in the Computer Vision field, there is a need to study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data. In this work, we follow a simple yet restrictive method for fine-tuning both CNN and Transformer models pretrained on ImageNet1K on CIFAR-10 and compare them with each other. We only unfreeze the last transformer/encoder or last convolutional block of a model and freeze all the layers before it while adding a simple MLP at the end for classification. This simple modification lets us use the raw learned weights of both these neural networks. From our experiments, we find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.

Related papers

Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z)
OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z)
The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel Size might be All You Need [103.31261028244782]
Vision Transformers have been rapidly uprising in computer vision thanks to their outstanding scaling trends, and gradually replacing convolutional neural networks (CNNs) Recent works on self-supervised learning (SSL) introduce siamese pre-training tasks. People come to believe that Transformers or self-attention modules are inherently more suitable than CNNs in the context of SSL.
arXiv Detail & Related papers (2023-12-09T22:23:57Z)
Transfer Learning for Microstructure Segmentation with CS-UNet: A Hybrid Algorithm with Transformer and CNN Encoders [0.2353157426758003]
We compare the segmentation performance of Transformer and CNN models pre-trained on microscopy images with those pre-trained on natural images. We also find that for image segmentation, the combination of pre-trained Transformers and CNN encoders are consistently better than pre-trained CNN encoders alone.
arXiv Detail & Related papers (2023-08-26T16:56:15Z)
Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification. We find Vision Transformers perform poorly on a semi-supervised ImageNet setting. CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z)
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z)
Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute. We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z)
Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture. The approach basically has no new inventions, which is combined from MoCo v2 and BYOL. The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z)
Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques. We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z)
Escaping the Big Data Paradigm with Compact Transformers [7.697698018200631]
We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results.
arXiv Detail & Related papers (2021-04-12T17:58:56Z)
A Neural Few-Shot Text Classification Reality Check [4.689945062721168]
Several neural few-shot classification models have emerged, yielding significant progress over time. In this paper, we compare all these models, first adapting those made in the field of image processing to NLP, and second providing them access to transformers. We then test these models equipped with the same transformer-based encoder on the intent detection task, known for having a large number of classes.
arXiv Detail & Related papers (2021-01-28T15:46:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.