LT-ViT: A Vision Transformer for multi-label Chest X-ray classification
- URL: http://arxiv.org/abs/2311.07263v1
- Date: Mon, 13 Nov 2023 12:02:46 GMT
- Title: LT-ViT: A Vision Transformer for multi-label Chest X-ray classification
- Authors: Umar Marikkar and Sara Atito and Muhammad Awais and Adam Mahdi
- Abstract summary: Vision Transformers (ViTs) are widely adopted in medical imaging tasks, and some existing efforts have been directed towards vision-language training for Chest X-rays (CXRs)
We have developed LT-ViT, a transformer that utilizes combined attention between image tokens and randomly auxiliary tokens that represent labels.
- Score: 2.3022732986382213
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) are widely adopted in medical imaging tasks, and
some existing efforts have been directed towards vision-language training for
Chest X-rays (CXRs). However, we envision that there still exists a potential
for improvement in vision-only training for CXRs using ViTs, by aggregating
information from multiple scales, which has been proven beneficial for
non-transformer networks. Hence, we have developed LT-ViT, a transformer that
utilizes combined attention between image tokens and randomly initialized
auxiliary tokens that represent labels. Our experiments demonstrate that LT-ViT
(1) surpasses the state-of-the-art performance using pure ViTs on two publicly
available CXR datasets, (2) is generalizable to other pre-training methods and
therefore is agnostic to model initialization, and (3) enables model
interpretability without grad-cam and its variants.
Related papers
- A Comparative Study of CNN, ResNet, and Vision Transformers for Multi-Classification of Chest Diseases [0.0]
Vision Transformers (ViT) are powerful tools due to their scalability and ability to process large amounts of data.
We fine-tuned two variants of ViT models, one pre-trained on ImageNet and another trained from scratch, using the NIH Chest X-ray dataset.
Our study evaluates the performance of these models in the multi-label classification of 14 distinct diseases.
arXiv Detail & Related papers (2024-05-31T23:56:42Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - ViTs for SITS: Vision Transformers for Satellite Image Time Series [52.012084080257544]
We introduce a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT)
TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder.
arXiv Detail & Related papers (2023-01-12T11:33:07Z) - Training Vision-Language Transformers from Captions [80.00302205584335]
We introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders.
In a head-to-head comparison between ViLT and our model, we find that our approach outperforms ViLT on standard benchmarks.
arXiv Detail & Related papers (2022-05-19T00:19:48Z) - V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer [58.71845618090022]
We build a holistic attention model, namely V2X-ViT, to fuse information across on-road agents.
V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention.
To validate our approach, we create a large-scale V2X perception dataset.
arXiv Detail & Related papers (2022-03-20T20:18:25Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Vision Xformers: Efficient Attention for Image Classification [0.0]
We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers.
We show that ViX performs better than ViT in image classification consuming lesser computing resources.
arXiv Detail & Related papers (2021-07-05T19:24:23Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.