CAMANet: Class Activation Map Guided Attention Network for Radiology
Report Generation
- URL: http://arxiv.org/abs/2211.01412v2
- Date: Sun, 3 Mar 2024 10:41:29 GMT
- Title: CAMANet: Class Activation Map Guided Attention Network for Radiology
Report Generation
- Authors: Jun Wang, Abhir Bhalerao, Terry Yin, Simon See, Yulan He
- Abstract summary: Radiology report generation (RRG) has gained increasing research attention because of its huge potential to mitigate medical resource shortages.
Recent advancements in RRG are driven by improving a model's capabilities in encoding single-modal feature representations.
Few studies explicitly explore the cross-modal alignment between image regions and words.
We propose a Class Activation Map guided Attention Network (CAMANet) which explicitly promotes crossmodal alignment.
- Score: 24.072847985361925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Radiology report generation (RRG) has gained increasing research attention
because of its huge potential to mitigate medical resource shortages and aid
the process of disease decision making by radiologists. Recent advancements in
RRG are largely driven by improving a model's capabilities in encoding
single-modal feature representations, while few studies explicitly explore the
cross-modal alignment between image regions and words. Radiologists typically
focus first on abnormal image regions before composing the corresponding text
descriptions, thus cross-modal alignment is of great importance to learn a RRG
model which is aware of abnormalities in the image. Motivated by this, we
propose a Class Activation Map guided Attention Network (CAMANet) which
explicitly promotes crossmodal alignment by employing aggregated class
activation maps to supervise cross-modal attention learning, and simultaneously
enrich the discriminative information. CAMANet contains three complementary
modules: a Visual Discriminative Map Generation module to generate the
importance/contribution of each visual token; Visual Discriminative Map
Assisted Encoder to learn the discriminative representation and enrich the
discriminative information; and a Visual Textual Attention Consistency module
to ensure the attention consistency between the visual and textual tokens, to
achieve the cross-modal alignment. Experimental results demonstrate that
CAMANet outperforms previous SOTA methods on two commonly used RRG benchmarks.
Related papers
- See Detail Say Clear: Towards Brain CT Report Generation via Pathological Clue-driven Representation Learning [12.40415847810958]
We introduce a Pathological Clue-driven Representation Learning (PCRL) model to build cross-modal representations based on pathological clues.
Specifically, we construct pathological clues from perspectives of segmented regions, pathological entities, and report themes.
To adapt the representations for the text generation task, we bridge the gap between representation learning and report generation by using a unified large language model (LLM) with task-tailored instructions.
arXiv Detail & Related papers (2024-09-29T12:08:20Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Attention-Map Augmentation for Hypercomplex Breast Cancer Classification [6.098816895102301]
We propose a framework, parameterized hypercomplex attention maps (PHAM), to overcome problems with breast cancer classification.
The framework offers two main advantages. First, attention maps provide critical information regarding the ROI and allow the neural model to concentrate on it.
We surpass attention-based state-of-the-art networks and the real-valued counterpart of our approach.
arXiv Detail & Related papers (2023-10-11T16:28:24Z) - Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation [48.723504098917324]
We propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments.
We introduce three novel modules: Latent Space Unifier, Cross-modal Representation Aligner and Text-to-Image Refiner.
Experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
arXiv Detail & Related papers (2023-03-28T12:42:12Z) - Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance.
Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas.
We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z) - Multi-Granularity Cross-modal Alignment for Generalized Medical Visual
Representation Learning [24.215619918283462]
We present a novel framework for learning medical visual representations directly from paired radiology reports.
Our framework harnesses the naturally exhibited semantic correspondences between medical image and radiology reports at three different levels.
arXiv Detail & Related papers (2022-10-12T09:31:39Z) - Cross-modal Memory Networks for Radiology Report Generation [30.13916304931662]
Cross-modal memory networks (CMN) are proposed to enhance the encoder-decoder framework for radiology report generation.
Our model is able to better align information from radiology images and texts so as to help generating more accurate reports in terms of clinical indicators.
arXiv Detail & Related papers (2022-04-28T02:32:53Z) - Cross-Modal Contrastive Learning for Abnormality Classification and
Localization in Chest X-rays with Radiomics using a Feedback Loop [63.81818077092879]
We propose an end-to-end semi-supervised cross-modal contrastive learning framework for medical images.
We first apply an image encoder to classify the chest X-rays and to generate the image features.
The radiomic features are then passed through another dedicated encoder to act as the positive sample for the image features generated from the same chest X-ray.
arXiv Detail & Related papers (2021-04-11T09:16:29Z) - Attention Model Enhanced Network for Classification of Breast Cancer
Image [54.83246945407568]
AMEN is formulated in a multi-branch fashion with pixel-wised attention model and classification submodular.
To focus more on subtle detail information, the sample image is enhanced by the pixel-wised attention map generated from former branch.
Experiments conducted on three benchmark datasets demonstrate the superiority of the proposed method under various scenarios.
arXiv Detail & Related papers (2020-10-07T08:44:21Z) - Auxiliary Signal-Guided Knowledge Encoder-Decoder for Medical Report
Generation [107.3538598876467]
We propose an Auxiliary Signal-Guided Knowledge-Decoder (ASGK) to mimic radiologists' working patterns.
ASGK integrates internal visual feature fusion and external medical linguistic information to guide medical knowledge transfer and learning.
arXiv Detail & Related papers (2020-06-06T01:00:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.