Data-Efficient Vision Transformers for Multi-Label Disease
Classification on Chest Radiographs
- URL: http://arxiv.org/abs/2208.08166v1
- Date: Wed, 17 Aug 2022 09:07:45 GMT
- Title: Data-Efficient Vision Transformers for Multi-Label Disease
Classification on Chest Radiographs
- Authors: Finn Behrendt, Debayan Bhattacharya, Julia Kr\"uger, Roland Opfer,
Alexander Schlaefer
- Abstract summary: Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images.
ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present.
Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
- Score: 55.78588835407174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Radiographs are a versatile diagnostic tool for the detection and assessment
of pathologies, for treatment planning or for navigation and localization
purposes in clinical interventions. However, their interpretation and
assessment by radiologists can be tedious and error-prone. Thus, a wide variety
of deep learning methods have been proposed to support radiologists
interpreting radiographs. Mostly, these approaches rely on convolutional neural
networks (CNN) to extract features from images. Especially for the multi-label
classification of pathologies on chest radiographs (Chest X-Rays, CXR), CNNs
have proven to be well suited. On the Contrary, Vision Transformers (ViTs) have
not been applied to this task despite their high classification performance on
generic images and interpretable local saliency maps which could add value to
clinical interventions. ViTs do not rely on convolutions but on patch-based
self-attention and in contrast to CNNs, no prior knowledge of local
connectivity is present. While this leads to increased capacity, ViTs typically
require an excessive amount of training data which represents a hurdle in the
medical domain as high costs are associated with collecting large medical data
sets. In this work, we systematically compare the classification performance of
ViTs and CNNs for different data set sizes and evaluate more data-efficient ViT
variants (DeiT). Our results show that while the performance between ViTs and
CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a
reasonably large data set is available for training.
Related papers
- CathFlow: Self-Supervised Segmentation of Catheters in Interventional Ultrasound Using Optical Flow and Transformers [66.15847237150909]
We introduce a self-supervised deep learning architecture to segment catheters in longitudinal ultrasound images.
The network architecture builds upon AiAReSeg, a segmentation transformer built with the Attention in Attention mechanism.
We validated our model on a test dataset, consisting of unseen synthetic data and images collected from silicon aorta phantoms.
arXiv Detail & Related papers (2024-03-21T15:13:36Z) - A Recent Survey of Vision Transformers for Medical Image Segmentation [2.4895533667182703]
Vision Transformers (ViTs) have emerged as a promising technique for addressing the challenges in medical image segmentation.
Their multi-scale attention mechanism enables effective modeling of long-range dependencies between distant structures.
Recently, researchers have come up with various ViT-based approaches that incorporate CNNs in their architectures, known as Hybrid Vision Transformers (HVTs)
arXiv Detail & Related papers (2023-12-01T14:54:44Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - EchoCoTr: Estimation of the Left Ventricular Ejection Fraction from
Spatiotemporal Echocardiography [0.0]
We propose a method that addresses the limitations we typically face when training on medical video data such as echocardiographic scans.
The algorithm we propose (EchoTr) utilizes the strength of vision transformers and CNNs to tackle the problem of estimating the left ventricular ejection fraction (LVEF) on ultrasound videos.
arXiv Detail & Related papers (2022-09-09T11:01:59Z) - RadTex: Learning Efficient Radiograph Representations from Text Reports [7.090896766922791]
We build a data-efficient learning framework that utilizes radiology reports to improve medical image classification performance with limited labeled data.
Our model achieves higher classification performance than ImageNet-supervised pretraining when labeled training data is limited.
arXiv Detail & Related papers (2022-08-05T15:06:26Z) - Preservation of High Frequency Content for Deep Learning-Based Medical
Image Classification [74.84221280249876]
An efficient analysis of large amounts of chest radiographs can aid physicians and radiologists.
We propose a novel Discrete Wavelet Transform (DWT)-based method for the efficient identification and encoding of visual information.
arXiv Detail & Related papers (2022-05-08T15:29:54Z) - An Analysis of the Influence of Transfer Learning When Measuring the
Tortuosity of Blood Vessels [0.7646713951724011]
Convolutional Neural Networks (CNNs) have been shown to provide excellent results regarding the segmentation of blood vessels.
Yet, it is still unclear if pre-trained CNNs can provide robust, unbiased, results on downstream tasks when applied to datasets that they were not trained on.
We show that the tortuosity values obtained by a CNN trained from scratch on a dataset may not agree with those obtained by a fine-tuned network that was pre-trained on a dataset having different tortuosity statistics.
arXiv Detail & Related papers (2021-11-19T14:55:52Z) - Voice-assisted Image Labelling for Endoscopic Ultrasound Classification
using Neural Networks [48.732863591145964]
We propose a multi-modal convolutional neural network architecture that labels endoscopic ultrasound (EUS) images from raw verbal comments provided by a clinician during the procedure.
Our results show a prediction accuracy of 76% at image level on a dataset with 5 different labels.
arXiv Detail & Related papers (2021-10-12T21:22:24Z) - Vision Transformers for femur fracture classification [59.99241204074268]
The Vision Transformer (ViT) was able to correctly predict 83% of the test images.
Good results were obtained in sub-fractures with the largest and richest dataset ever.
arXiv Detail & Related papers (2021-08-07T10:12:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.