Pretrained ViTs Yield Versatile Representations For Medical Images
- URL: http://arxiv.org/abs/2303.07034v3
- Date: Fri, 15 Nov 2024 15:31:44 GMT
- Title: Pretrained ViTs Yield Versatile Representations For Medical Images
- Authors: Christos Matsoukas, Johan Fredin Haslum, Moein Sorkhei, Magnus Söderberg, Kevin Smith,
- Abstract summary: Vision transformers (ViTs) have appeared as a competitive alternative to CNNs.
We conduct a series of experiments on several standard 2D medical image benchmark datasets and tasks.
Our findings show that, while CNNs perform better if trained from scratch, off-the-shelf vision transformers can perform on par with CNNs when pretrained on ImageNet.
- Score: 4.443013185089128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional Neural Networks (CNNs) have reigned for a decade as the de facto approach to automated medical image diagnosis, pushing the state-of-the-art in classification, detection and segmentation tasks. Over the last years, vision transformers (ViTs) have appeared as a competitive alternative to CNNs, yielding impressive levels of performance in the natural image domain, while possessing several interesting properties that could prove beneficial for medical imaging tasks. In this work, we explore the benefits and drawbacks of transformer-based models for medical image classification. We conduct a series of experiments on several standard 2D medical image benchmark datasets and tasks. Our findings show that, while CNNs perform better if trained from scratch, off-the-shelf vision transformers can perform on par with CNNs when pretrained on ImageNet, both in a supervised and self-supervised setting, rendering them as a viable alternative to CNNs.
Related papers
- A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Medical Image Classification [5.904095466127043]
We introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for medical image classification.
Our model achieves state-of-the-art predictive performance compared to both black-box and interpretable models.
arXiv Detail & Related papers (2025-04-11T12:15:22Z) - MedSegDiff-V2: Diffusion based Medical Image Segmentation with
Transformer [53.575573940055335]
We propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2.
We verify its effectiveness on 20 medical image segmentation tasks with different image modalities.
arXiv Detail & Related papers (2023-01-19T03:42:36Z) - A New Perspective to Boost Vision Transformer for Medical Image
Classification [33.215289791017064]
We propose a self-supervised learning approach specifically for medical image classification with the Transformer backbone.
Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning.
The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art self-supervised learning approaches.
arXiv Detail & Related papers (2023-01-03T07:45:59Z) - Delving into Masked Autoencoders for Multi-Label Thorax Disease
Classification [16.635426201975587]
Vision Transformer (ViT) has shown inferior performance to Convolutional Neural Network (CNN) on medical tasks due to its data-hungry nature and the lack of annotated medical data.
In this paper, we pre-train ViTs on 266,340 chest X-rays using Masked Autoencoders (MAE) which reconstruct missing pixels from a small part of each image.
The results show that our pre-trained ViT performs comparably (sometimes better) to the state-of-the-art CNN (DenseNet-121) for multi-label thorax disease classification.
arXiv Detail & Related papers (2022-10-23T20:14:57Z) - Data-Efficient Vision Transformers for Multi-Label Disease
Classification on Chest Radiographs [55.78588835407174]
Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images.
ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present.
Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
arXiv Detail & Related papers (2022-08-17T09:07:45Z) - Prune and distill: similar reformatting of image information along rat
visual cortex and deep neural networks [61.60177890353585]
Deep convolutional neural networks (CNNs) have been shown to provide excellent models for its functional analogue in the brain, the ventral stream in visual cortex.
Here we consider some prominent statistical patterns that are known to exist in the internal representations of either CNNs or the visual cortex.
We show that CNNs and visual cortex share a similarly tight relationship between dimensionality expansion/reduction of object representations and reformatting of image information.
arXiv Detail & Related papers (2022-05-27T08:06:40Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Transformers in Medical Imaging: A Survey [88.03790310594533]
Transformers have been successfully applied to several computer vision problems, achieving state-of-the-art results.
Medical imaging has also witnessed growing interest for Transformers that can capture global context compared to CNNs with local receptive fields.
We provide a review of the applications of Transformers in medical imaging covering various aspects, ranging from recently proposed architectural designs to unsolved issues.
arXiv Detail & Related papers (2022-01-24T18:50:18Z) - Transformer-Unet: Raw Image Processing with Unet [4.7944896477309555]
We propose Transformer-Unet by adding transformer modules in raw images instead of feature maps in Unet.
We form an end-to-end network and gain segmentation results better than many previous Unet based algorithms in our experiment.
arXiv Detail & Related papers (2021-09-17T09:03:10Z) - Is it Time to Replace CNNs with Transformers for Medical Images? [2.216181561365727]
Vision transformers (ViTs) have appeared as a competitive alternative to CNNs.
We consider these questions in a series of experiments on three mainstream medical image datasets.
arXiv Detail & Related papers (2021-08-20T08:01:19Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - TransMed: Transformers Advance Multi-modal Medical Image Classification [4.500880052705654]
convolutional neural networks (CNN) have shown very competitive performance in medical image analysis tasks.
Transformers have been applied to computer vision and achieved remarkable success in large-scale datasets.
TransMed combines the advantages of CNN and transformer to efficiently extract low-level features of images.
arXiv Detail & Related papers (2021-03-10T08:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.