Related papers: SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

URL: http://arxiv.org/abs/2404.17912v2
Date: Thu, 18 Jul 2024 16:03:18 GMT
Title: SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models
Authors: Manav Nitin Kapadnis, Sohan Patnaik, Abhilash Nandy, Sourjyadip Ray, Pawan Goyal, Debdoot Sheet,
Abstract summary: Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content. We introduce a novel strategy, which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework.
Score: 9.390882250428305
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image and the generated text, therefore reducing hallucination and continuously enhancing nuanced report generation. SERPENT-VLM outperforms existing baselines such as LLaVA-Med, BiomedGPT, etc., achieving SoTA performance on the IU X-ray and Radiology Objects in COntext (ROCO) datasets, and also proves to be robust against noisy images. A qualitative case study emphasizes the significant advancements towards more sophisticated MLLM frameworks for R2Gen, opening paths for further research into self-supervised refinement in the medical imaging domain.

Related papers

VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback [1.5839621757142595]
We propose a novel framework designed to enhance the semantic alignment and localization accuracy of AI-generated medical reports. By comparing features between the original and generated images, we introduce a dual-scoring system. This approach significantly outperforms existing methods, achieving state-of-the-art results in pathology localization and text-to-image alignment.
arXiv Detail & Related papers (2025-01-29T16:02:16Z)
Activating Associative Disease-Aware Vision Token Memory for LLM-Based X-ray Report Generation [54.631356899598956]
We propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports. We employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information.
arXiv Detail & Related papers (2025-01-07T01:19:48Z)
LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation [1.1029725477806065]
We propose Label Boosted Retrieval Augmented Generation (LaB-RAG) to generate radiology reports. We show that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods. We critique the use of a popular RRG metric, arguing it is possible to artificially inflate its results without true data-leakage.
arXiv Detail & Related papers (2024-11-25T16:10:05Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation [7.4871243017824165]
This paper proposes a novel context-guided efficient X-ray medical report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model.
arXiv Detail & Related papers (2024-08-19T07:15:11Z)
MAIRA-2: Grounded Radiology Report Generation [39.7576903743788]
Radiology reporting is a complex task that requires detailed image understanding, integration of multiple inputs, and precise language generation. Here, we extend report generation to include the localisation of individual findings on the image - a task we call grounded report generation. We introduce MAIRA-2, a large multimodal model combining a radiology-specific image encoder with a LLM, and trained for the new task of grounded report generation on chest X-rays.
arXiv Detail & Related papers (2024-06-06T19:12:41Z)
Dynamic Traceback Learning for Medical Report Generation [12.746275623663289]
This study proposes a novel multi-modal dynamic traceback learning framework (DTrace) for medical report generation. We introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input. The proposed DTrace framework outperforms state-of-the-art methods for medical report generation.
arXiv Detail & Related papers (2024-01-24T07:13:06Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model. It can analyze and answer open-ended questions about chest radiographs. We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)
An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians. Recent studies have achieved promising results in automatic impression generation using large-scale medical text data. These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance. Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas. We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Cross-modal Memory Networks for Radiology Report Generation [30.13916304931662]
Cross-modal memory networks (CMN) are proposed to enhance the encoder-decoder framework for radiology report generation. Our model is able to better align information from radiology images and texts so as to help generating more accurate reports in terms of clinical indicators.
arXiv Detail & Related papers (2022-04-28T02:32:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.