Is it Time to Replace CNNs with Transformers for Medical Images?
- URL: http://arxiv.org/abs/2108.09038v1
- Date: Fri, 20 Aug 2021 08:01:19 GMT
- Title: Is it Time to Replace CNNs with Transformers for Medical Images?
- Authors: Christos Matsoukas, Johan Fredin Haslum, Magnus S\"oderberg and Kevin
Smith
- Abstract summary: Vision transformers (ViTs) have appeared as a competitive alternative to CNNs.
We consider these questions in a series of experiments on three mainstream medical image datasets.
- Score: 2.216181561365727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional Neural Networks (CNNs) have reigned for a decade as the de
facto approach to automated medical image diagnosis. Recently, vision
transformers (ViTs) have appeared as a competitive alternative to CNNs,
yielding similar levels of performance while possessing several interesting
properties that could prove beneficial for medical imaging tasks. In this work,
we explore whether it is time to move to transformer-based models or if we
should keep working with CNNs - can we trivially switch to transformers? If so,
what are the advantages and drawbacks of switching to ViTs for medical image
diagnosis? We consider these questions in a series of experiments on three
mainstream medical image datasets. Our findings show that, while CNNs perform
better when trained from scratch, off-the-shelf vision transformers using
default hyperparameters are on par with CNNs when pretrained on ImageNet, and
outperform their CNN counterparts when pretrained using self-supervision.
Related papers
- A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Medical Image Classification [5.904095466127043]
We introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for medical image classification.
Our model achieves state-of-the-art predictive performance compared to both black-box and interpretable models.
arXiv Detail & Related papers (2025-04-11T12:15:22Z) - OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve.
We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap.
This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z) - The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel
Size might be All You Need [103.31261028244782]
Vision Transformers have been rapidly uprising in computer vision thanks to their outstanding scaling trends, and gradually replacing convolutional neural networks (CNNs)
Recent works on self-supervised learning (SSL) introduce siamese pre-training tasks.
People come to believe that Transformers or self-attention modules are inherently more suitable than CNNs in the context of SSL.
arXiv Detail & Related papers (2023-12-09T22:23:57Z) - MobileUtr: Revisiting the relationship between light-weight CNN and
Transformer for efficient medical image segmentation [25.056401513163493]
This work revisits the relationship between CNNs and Transformers in lightweight universal networks for medical image segmentation.
In order to leverage the inductive bias inherent in CNNs, we abstract a Transformer-like lightweight CNNs block (ConvUtr) as the patch embeddings of ViTs.
We build an efficient medical image segmentation model (MobileUtr) based on CNN and Transformer.
arXiv Detail & Related papers (2023-12-04T09:04:05Z) - Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing [64.7892681641764]
We train vision transformers (ViTs) and convolutional neural networks (CNNs)
We find that ViTs do not improve nor degrade when trained using Patch Mixing.
We conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess.
arXiv Detail & Related papers (2023-06-30T17:59:53Z) - Pretrained ViTs Yield Versatile Representations For Medical Images [4.443013185089128]
Vision transformers (ViTs) have appeared as a competitive alternative to CNNs.
We conduct a series of experiments on several standard 2D medical image benchmark datasets and tasks.
Our findings show that, while CNNs perform better if trained from scratch, off-the-shelf vision transformers can perform on par with CNNs when pretrained on ImageNet.
arXiv Detail & Related papers (2023-03-13T11:53:40Z) - Data-Efficient Vision Transformers for Multi-Label Disease
Classification on Chest Radiographs [55.78588835407174]
Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images.
ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present.
Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
arXiv Detail & Related papers (2022-08-17T09:07:45Z) - Transformers in Medical Imaging: A Survey [88.03790310594533]
Transformers have been successfully applied to several computer vision problems, achieving state-of-the-art results.
Medical imaging has also witnessed growing interest for Transformers that can capture global context compared to CNNs with local receptive fields.
We provide a review of the applications of Transformers in medical imaging covering various aspects, ranging from recently proposed architectural designs to unsolved issues.
arXiv Detail & Related papers (2022-01-24T18:50:18Z) - Semi-Supervised Medical Image Segmentation via Cross Teaching between
CNN and Transformer [11.381487613753004]
We present a framework for semi-supervised medical image segmentation by introducing the cross teaching between CNN and Transformer.
Notably, this work may be the first attempt to combine CNN and transformer for semi-supervised medical image segmentation and achieve promising results on a public benchmark.
arXiv Detail & Related papers (2021-12-09T13:22:38Z) - Transformed CNNs: recasting pre-trained convolutional layers with
self-attention [17.96659165573821]
Vision Transformers (ViT) have emerged as a powerful alternative to convolutional networks (CNNs)
In this work, we explore the idea of reducing the time spent training these layers by initializing them as convolutional layers.
With only 50 epochs of fine-tuning, the resulting T-CNNs demonstrate significant performance gains.
arXiv Detail & Related papers (2021-06-10T14:56:10Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.