Related papers: Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2

Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2

URL: http://arxiv.org/abs/2501.12356v1
Date: Tue, 21 Jan 2025 18:36:18 GMT
Title: Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2
Authors: Md. Rakibul Islam, Md. Zahid Hossain, Mustofa Ahmed, Most. Sharmin Sultana Samu,
Abstract summary: We have evaluated different combinations of multimodal models that integrate Computer Vision and Natural Language Processing to generate radiology reports.<n>We used Chest X-ray images and reports from the IU-Xray dataset to evaluate the usability of the SWIN Transformer-BART, SWIN Transformer-GPT-2, ViT-B16-BART and ViT-B16-GPT-2 models for report generation.<n>The SWIN-BART model performs as the best-performing model among the four models achieving remarkable results in almost all the evaluation metrics like ROUGE, BLEU and BERTScore.
Score: 0.1874930567916036
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Radiology plays a pivotal role in modern medicine due to its non-invasive diagnostic capabilities. However, the manual generation of unstructured medical reports is time consuming and prone to errors. It creates a significant bottleneck in clinical workflows. Despite advancements in AI-generated radiology reports, challenges remain in achieving detailed and accurate report generation. In this study we have evaluated different combinations of multimodal models that integrate Computer Vision and Natural Language Processing to generate comprehensive radiology reports. We employed a pretrained Vision Transformer (ViT-B16) and a SWIN Transformer as the image encoders. The BART and GPT-2 models serve as the textual decoders. We used Chest X-ray images and reports from the IU-Xray dataset to evaluate the usability of the SWIN Transformer-BART, SWIN Transformer-GPT-2, ViT-B16-BART and ViT-B16-GPT-2 models for report generation. We aimed at finding the best combination among the models. The SWIN-BART model performs as the best-performing model among the four models achieving remarkable results in almost all the evaluation metrics like ROUGE, BLEU and BERTScore.

Related papers

ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays [1.9827390755712084]
Vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently.<n>We present ChestGPT, a framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images.<n>The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76.
arXiv Detail & Related papers (2025-07-04T17:58:52Z)
CRRG-CLIP: Automatic Generation of Chest Radiology Reports and Classification of Chest Radiographs [2.1711205684359247]
The CRRG-CLIP Model is an end-to-end model for automated report generation and radiograph classification.<n>The generation module uses Faster R-CNN to identify anatomical regions in radiographs, a binary classifier to select key regions, and GPT-2 to generate semantically coherent reports.<n>The classification module uses the unsupervised Contrastive Language Image Pretraining (CLIP) model, addressing the challenges of high-cost labelled datasets.
arXiv Detail & Related papers (2024-12-31T03:07:27Z)
Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation [21.772106685777995]
We introduce a radiology-focused visual language model designed to generate radiology reports from chest X-rays. Our model combines an image encoder with a fine-tuned LLM based on the Vicuna-7B architecture, enabling it to generate different sections of a radiology report with notable accuracy.
arXiv Detail & Related papers (2024-12-06T11:14:03Z)
3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans. Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z)
R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation [7.4871243017824165]
This paper proposes a novel context-guided efficient X-ray medical report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model.
arXiv Detail & Related papers (2024-08-19T07:15:11Z)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision. This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z)
A Comparative Study of CNN, ResNet, and Vision Transformers for Multi-Classification of Chest Diseases [0.0]
Vision Transformers (ViT) are powerful tools due to their scalability and ability to process large amounts of data. We fine-tuned two variants of ViT models, one pre-trained on ImageNet and another trained from scratch, using the NIH Chest X-ray dataset. Our study evaluates the performance of these models in the multi-label classification of 14 distinct diseases.
arXiv Detail & Related papers (2024-05-31T23:56:42Z)
Summarizing Radiology Reports Findings into Impressions [1.8964110318127383]
We present a model with state-of-art radiology report summarization performance. We also provide an analysis of the model limitations and radiology knowledge gain. Our best performing model was a fine-tuned BERT-to-BERT encoder-decoder with 58.75/100 ROUGE-L F1.
arXiv Detail & Related papers (2024-05-10T20:29:25Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model. It can analyze and answer open-ended questions about chest radiographs. We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)
Customizing General-Purpose Foundation Models for Medical Report Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks. We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z)
Vision Transformer-based Model for Severity Quantification of Lung Pneumonia Using Chest X-ray Images [11.12596879975844]
We present a Vision Transformer-based neural network model that relies on a small number of trainable parameters to quantify the severity of COVID-19 and other lung diseases. Our model can provide peak performance in quantifying severity with high generalizability at a relatively low computational cost.
arXiv Detail & Related papers (2023-03-18T12:38:23Z)
Medical Image Captioning via Generative Pretrained Transformers [57.308920993032274]
We combine two language models, the Show-Attend-Tell and the GPT-3, to generate comprehensive and descriptive radiology records. The proposed model is tested on two medical datasets, the Open-I, MIMIC-CXR, and the general-purpose MS-COCO.
arXiv Detail & Related papers (2022-09-28T10:27:10Z)
Data-Efficient Vision Transformers for Multi-Label Disease Classification on Chest Radiographs [55.78588835407174]
Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images. ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present. Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
arXiv Detail & Related papers (2022-08-17T09:07:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.