Cross-Modal Causal Intervention for Medical Report Generation
- URL: http://arxiv.org/abs/2303.09117v4
- Date: Wed, 28 Feb 2024 08:57:09 GMT
- Title: Cross-Modal Causal Intervention for Medical Report Generation
- Authors: Weixing Chen, Yang Liu, Ce Wang, Jiarui Zhu, Shen Zhao, Guanbin Li,
Cheng-Lin Liu and Liang Lin
- Abstract summary: Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance.
Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas.
We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
- Score: 109.83549148448469
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical report generation (MRG) is essential for computer-aided diagnosis and
medication guidance, which can relieve the heavy burden of radiologists by
automatically generating the corresponding medical reports according to the
given radiology image. However, due to the spurious correlations within
image-text data induced by visual and linguistic biases, it is challenging to
generate accurate reports reliably describing lesion areas. Moreover, the
cross-modal confounders are usually unobservable and challenging to be
eliminated explicitly. In this paper, we aim to mitigate the cross-modal data
bias for MRG from a new perspective, i.e., cross-modal causal intervention, and
propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG,
which consists of a visual deconfounding module (VDM) and a linguistic
deconfounding module (LDM), to implicitly mitigate the visual-linguistic
confounders by causal front-door intervention. Specifically, due to the absence
of a generalized semantic extractor, the VDM explores and disentangles the
visual confounders from the patch-based local and global features without
expensive fine-grained annotations. Simultaneously, due to the lack of
knowledge encompassing the entire field of medicine, the LDM eliminates the
linguistic confounders caused by salient visual features and high-frequency
context without constructing a terminology database. Extensive experiments on
IU-Xray and MIMIC-CXR datasets show that our VLCI significantly outperforms the
state-of-the-art MRG methods. The code and models are available at
https://github.com/WissingChen/VLCI.
Related papers
- Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation [42.13004422063442]
Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient.
Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance.
While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods.
arXiv Detail & Related papers (2024-11-23T08:18:55Z) - TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model [22.305034251561835]
We propose a truthful radiology report generation framework, namely TRRG, based on stage-wise training for cross-modal disease clue injection into large language models.
Our proposed framework achieves state-of-the-art performance in radiology report generation on datasets such as IU-Xray and MIMIC-CXR.
arXiv Detail & Related papers (2024-08-22T05:52:27Z) - SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models [9.390882250428305]
Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports.
Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content.
We introduce a novel strategy, which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework.
arXiv Detail & Related papers (2024-04-27T13:46:23Z) - Dynamic Traceback Learning for Medical Report Generation [12.746275623663289]
This study proposes a novel multi-modal dynamic traceback learning framework (DTrace) for medical report generation.
We introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input.
The proposed DTrace framework outperforms state-of-the-art methods for medical report generation.
arXiv Detail & Related papers (2024-01-24T07:13:06Z) - Medical Report Generation based on Segment-Enhanced Contrastive
Representation Learning [39.17345313432545]
We propose MSCL (Medical image with Contrastive Learning) to segment organs, abnormalities, bones, etc.
We introduce a supervised contrastive loss that assigns more weight to reports that are semantically similar to the target while training.
Experimental results demonstrate the effectiveness of our proposed model, where we achieve state-of-the-art performance on the IU X-Ray public dataset.
arXiv Detail & Related papers (2023-12-26T03:33:48Z) - Radiology Report Generation Using Transformers Conditioned with
Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information.
The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z) - Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report
Generation [92.73584302508907]
We propose a knowledge graph with Dynamic structure and nodes to facilitate medical report generation with Contrastive Learning.
In detail, the fundamental structure of our graph is pre-constructed from general knowledge.
Each image feature is integrated with its very own updated graph before being fed into the decoder module for report generation.
arXiv Detail & Related papers (2023-03-18T03:53:43Z) - Cross-Modal Causal Relational Reasoning for Event-Level Visual Question
Answering [134.91774666260338]
Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes.
We propose a framework for cross-modal causal relational reasoning to address the task of event-level visual question answering.
arXiv Detail & Related papers (2022-07-26T04:25:54Z) - AlignTransformer: Hierarchical Alignment of Visual Regions and Disease
Tags for Medical Report Generation [50.21065317817769]
We propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules.
Experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets.
arXiv Detail & Related papers (2022-03-18T13:43:53Z) - Auxiliary Signal-Guided Knowledge Encoder-Decoder for Medical Report
Generation [107.3538598876467]
We propose an Auxiliary Signal-Guided Knowledge-Decoder (ASGK) to mimic radiologists' working patterns.
ASGK integrates internal visual feature fusion and external medical linguistic information to guide medical knowledge transfer and learning.
arXiv Detail & Related papers (2020-06-06T01:00:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.