A Comparative Study of CNN, ResNet, and Vision Transformers for Multi-Classification of Chest Diseases
- URL: http://arxiv.org/abs/2406.00237v1
- Date: Fri, 31 May 2024 23:56:42 GMT
- Title: A Comparative Study of CNN, ResNet, and Vision Transformers for Multi-Classification of Chest Diseases
- Authors: Ananya Jain, Aviral Bhardwaj, Kaushik Murali, Isha Surani,
- Abstract summary: Vision Transformers (ViT) are powerful tools due to their scalability and ability to process large amounts of data.
We fine-tuned two variants of ViT models, one pre-trained on ImageNet and another trained from scratch, using the NIH Chest X-ray dataset.
Our study evaluates the performance of these models in the multi-label classification of 14 distinct diseases.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models, notably utilizing Transformer architectures, have emerged as powerful tools due to their scalability and ability to process large amounts of data. Dosovitskiy et al. expanded this architecture to introduce Vision Transformers (ViT), extending its applicability to image processing tasks. Motivated by this advancement, we fine-tuned two variants of ViT models, one pre-trained on ImageNet and another trained from scratch, using the NIH Chest X-ray dataset containing over 100,000 frontal-view X-ray images. Our study evaluates the performance of these models in the multi-label classification of 14 distinct diseases, while using Convolutional Neural Networks (CNNs) and ResNet architectures as baseline models for comparison. Through rigorous assessment based on accuracy metrics, we identify that the pre-trained ViT model surpasses CNNs and ResNet in this multilabel classification task, highlighting its potential for accurate diagnosis of various lung conditions from chest X-ray images.
Related papers
- Heterogeneous Generative Knowledge Distillation with Masked Image
Modeling [33.95780732124864]
Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models.
We develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion.
Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models.
arXiv Detail & Related papers (2023-09-18T08:30:55Z) - Performance of GAN-based augmentation for deep learning COVID-19 image
classification [57.1795052451257]
The biggest challenge in the application of deep learning to the medical domain is the availability of training data.
Data augmentation is a typical methodology used in machine learning when confronted with a limited data set.
In this work, a StyleGAN2-ADA model of Generative Adversarial Networks is trained on the limited COVID-19 chest X-ray image set.
arXiv Detail & Related papers (2023-04-18T15:39:58Z) - Vision Transformer-based Model for Severity Quantification of Lung
Pneumonia Using Chest X-ray Images [11.12596879975844]
We present a Vision Transformer-based neural network model that relies on a small number of trainable parameters to quantify the severity of COVID-19 and other lung diseases.
Our model can provide peak performance in quantifying severity with high generalizability at a relatively low computational cost.
arXiv Detail & Related papers (2023-03-18T12:38:23Z) - Magnification Invariant Medical Image Analysis: A Comparison of
Convolutional Networks, Vision Transformers, and Token Mixers [2.3859625728972484]
Convolution Neural Networks (CNNs) are widely used in medical image analysis.
Their performance degrade when the magnification of testing images differ from the training images.
This study aims to evaluate the robustness of various deep learning architectures.
arXiv Detail & Related papers (2023-02-22T16:44:41Z) - Data-Efficient Vision Transformers for Multi-Label Disease
Classification on Chest Radiographs [55.78588835407174]
Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images.
ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present.
Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
arXiv Detail & Related papers (2022-08-17T09:07:45Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - UNetFormer: A Unified Vision Transformer Model and Pre-Training
Framework for 3D Medical Image Segmentation [14.873473285148853]
We introduce a unified framework consisting of two architectures, dubbed UNetFormer, with a 3D Swin Transformer-based encoder and Conal Neural Network (CNN) and transformer-based decoders.
In the proposed model, the encoder is linked to the decoder via skip connections at five different resolutions with deep supervision.
We present a methodology for self-supervised pre-training of the encoder backbone via learning to predict randomly masked tokens.
arXiv Detail & Related papers (2022-04-01T17:38:39Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z) - Intra-model Variability in COVID-19 Classification Using Chest X-ray
Images [0.0]
We quantify baseline performance metrics and variability for COVID-19 detection in chest x-ray for 12 common deep learning architectures.
Best performing models achieve a false negative rate of 3 out of 20 for detecting COVID-19 in a hold-out set.
arXiv Detail & Related papers (2020-04-30T21:20:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.