Related papers: Delving into Masked Autoencoders for Multi-Label Thorax Disease Classification

Delving into Masked Autoencoders for Multi-Label Thorax Disease Classification

URL: http://arxiv.org/abs/2210.12843v1
Date: Sun, 23 Oct 2022 20:14:57 GMT
Title: Delving into Masked Autoencoders for Multi-Label Thorax Disease Classification
Authors: Junfei Xiao, Yutong Bai, Alan Yuille and Zongwei Zhou
Abstract summary: Vision Transformer (ViT) has shown inferior performance to Convolutional Neural Network (CNN) on medical tasks due to its data-hungry nature and the lack of annotated medical data. In this paper, we pre-train ViTs on 266,340 chest X-rays using Masked Autoencoders (MAE) which reconstruct missing pixels from a small part of each image. The results show that our pre-trained ViT performs comparably (sometimes better) to the state-of-the-art CNN (DenseNet-121) for multi-label thorax disease classification.
Score: 16.635426201975587
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformer (ViT) has become one of the most popular neural architectures due to its great scalability, computational efficiency, and compelling performance in many vision tasks. However, ViT has shown inferior performance to Convolutional Neural Network (CNN) on medical tasks due to its data-hungry nature and the lack of annotated medical data. In this paper, we pre-train ViTs on 266,340 chest X-rays using Masked Autoencoders (MAE) which reconstruct missing pixels from a small part of each image. For comparison, CNNs are also pre-trained on the same 266,340 X-rays using advanced self-supervised methods (e.g., MoCo v2). The results show that our pre-trained ViT performs comparably (sometimes better) to the state-of-the-art CNN (DenseNet-121) for multi-label thorax disease classification. This performance is attributed to the strong recipes extracted from our empirical studies for pre-training and fine-tuning ViT. The pre-training recipe signifies that medical reconstruction requires a much smaller proportion of an image (10% vs. 25%) and a more moderate random resized crop range (0.5~1.0 vs. 0.2~1.0) compared with natural imaging. Furthermore, we remark that in-domain transfer learning is preferred whenever possible. The fine-tuning recipe discloses that layer-wise LR decay, RandAug magnitude, and DropPath rate are significant factors to consider. We hope that this study can direct future research on the application of Transformers to a larger variety of medical imaging tasks.

Related papers

Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers [0.0]
This paper presents a comparison of the performance of CNNs and ViTs in the task of multi-classifying images containing lesions of psoriasis and diseases similar to it.<n>The ViTs stood out for their superior performance with smaller models.<n>This article reinforces the potential of ViTs for medical image classification tasks.
arXiv Detail & Related papers (2025-06-11T19:00:32Z)
Self-supervised learning improves robustness of deep learning lung tumor segmentation to CT imaging differences [7.332652485849634]
Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data. We compare robustness of wild versus self-pretrained transformer (ViT) and hierarchical shifted window (Swin) models to computed tomography (CT) imaging differences. Wild-pretrained networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods.
arXiv Detail & Related papers (2024-05-14T14:35:21Z)
MoVL:Exploring Fusion Strategies for the Domain-Adaptive Application of Pretrained Models in Medical Imaging Tasks [6.8948885302235325]
We introduce visual prompting (VP) to fill in the gap between input medical images and natural pretrained vision model. We design a joint learning loss function containing categorisation loss and discrepancy loss, which describe the variance of prompted and plain images. On out of distribution medical dataset, our method(90.33%) can outperform FF (85.15%) with absolute 5.18 % lead.
arXiv Detail & Related papers (2024-05-13T01:18:25Z)
Performance of GAN-based augmentation for deep learning COVID-19 image classification [57.1795052451257]
The biggest challenge in the application of deep learning to the medical domain is the availability of training data. Data augmentation is a typical methodology used in machine learning when confronted with a limited data set. In this work, a StyleGAN2-ADA model of Generative Adversarial Networks is trained on the limited COVID-19 chest X-ray image set.
arXiv Detail & Related papers (2023-04-18T15:39:58Z)
Pretrained ViTs Yield Versatile Representations For Medical Images [4.443013185089128]
Vision transformers (ViTs) have appeared as a competitive alternative to CNNs. We conduct a series of experiments on several standard 2D medical image benchmark datasets and tasks. Our findings show that, while CNNs perform better if trained from scratch, off-the-shelf vision transformers can perform on par with CNNs when pretrained on ImageNet.
arXiv Detail & Related papers (2023-03-13T11:53:40Z)
AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images. AMIGO uses the celluar graph within the tissue to provide a single representation for a patient. We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z)
Data-Efficient Vision Transformers for Multi-Label Disease Classification on Chest Radiographs [55.78588835407174]
Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images. ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present. Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
arXiv Detail & Related papers (2022-08-17T09:07:45Z)
Learning from few examples: Classifying sex from retinal images via deep learning [3.9146761527401424]
We showcase results for the performance of DL on small datasets to classify patient sex from fundus images. Our models, developed using approximately 2500 fundus images, achieved test AUC scores of up to 0.72. This corresponds to a mere 25% decrease in performance despite a nearly 1000-fold decrease in the dataset size.
arXiv Detail & Related papers (2022-07-20T02:47:29Z)
Self-supervised 3D anatomy segmentation using self-distilled masked image transformer (SMIT) [2.7298989068857487]
Self-supervised learning has demonstrated success in medical image segmentation using convolutional networks. We show our approach is more accurate and requires fewer fine tuning datasets than other pretext tasks.
arXiv Detail & Related papers (2022-05-20T17:55:14Z)
Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z)
Vision Transformers for femur fracture classification [59.99241204074268]
The Vision Transformer (ViT) was able to correctly predict 83% of the test images. Good results were obtained in sub-fractures with the largest and richest dataset ever.
arXiv Detail & Related papers (2021-08-07T10:12:42Z)
Classification of COVID-19 in CT Scans using Multi-Source Transfer Learning [91.3755431537592]
We propose the use of Multi-Source Transfer Learning to improve upon traditional Transfer Learning for the classification of COVID-19 from CT scans. With our multi-source fine-tuning approach, our models outperformed baseline models fine-tuned with ImageNet. Our best performing model was able to achieve an accuracy of 0.893 and a Recall score of 0.897, outperforming its baseline Recall score by 9.3%.
arXiv Detail & Related papers (2020-09-22T11:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.