Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation
- URL: http://arxiv.org/abs/2303.15932v5
- Date: Tue, 18 Jul 2023 03:06:31 GMT
- Title: Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation
- Authors: Yaowei Li and Bang Yang and Xuxin Cheng and Zhihong Zhu and Hongxiang
Li and Yuexian Zou
- Abstract summary: We propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments.
We introduce three novel modules: Latent Space Unifier, Cross-modal Representation Aligner and Text-to-Image Refiner.
Experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
- Score: 48.723504098917324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic radiology report generation has attracted enormous research
interest due to its practical value in reducing the workload of radiologists.
However, simultaneously establishing global correspondences between the image
(e.g., Chest X-ray) and its related report and local alignments between image
patches and keywords remains challenging. To this end, we propose an Unify,
Align and then Refine (UAR) approach to learn multi-level cross-modal
alignments and introduce three novel modules: Latent Space Unifier (LSU),
Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR).
Specifically, LSU unifies multimodal data into discrete tokens, making it
flexible to learn common knowledge among modalities with a shared network. The
modality-agnostic CRA learns discriminative features via a set of orthonormal
basis and a dual-gate mechanism first and then globally aligns visual and
textual representations under a triplet contrastive loss. TIR boosts
token-level local alignment via calibrating text-to-image attention with a
learnable mask. Additionally, we design a two-stage training procedure to make
UAR gradually grasp cross-modal alignments at different levels, which imitates
radiologists' workflow: writing sentence by sentence first and then checking
word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR
benchmark datasets demonstrate the superiority of our UAR against varied
state-of-the-art methods.
Related papers
- Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity [14.223539927549782]
We propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings.
Our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from images, via a captioning branch, and (2) findings, through a summarization branch.
Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements.
arXiv Detail & Related papers (2024-10-01T07:05:36Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - CAMANet: Class Activation Map Guided Attention Network for Radiology
Report Generation [24.072847985361925]
Radiology report generation (RRG) has gained increasing research attention because of its huge potential to mitigate medical resource shortages.
Recent advancements in RRG are driven by improving a model's capabilities in encoding single-modal feature representations.
Few studies explicitly explore the cross-modal alignment between image regions and words.
We propose a Class Activation Map guided Attention Network (CAMANet) which explicitly promotes crossmodal alignment.
arXiv Detail & Related papers (2022-11-02T18:14:33Z) - Multi-Granularity Cross-modal Alignment for Generalized Medical Visual
Representation Learning [24.215619918283462]
We present a novel framework for learning medical visual representations directly from paired radiology reports.
Our framework harnesses the naturally exhibited semantic correspondences between medical image and radiology reports at three different levels.
arXiv Detail & Related papers (2022-10-12T09:31:39Z) - Radiology Report Generation with a Learned Knowledge Base and
Multi-modal Alignment [27.111857943935725]
We present an automatic, multi-modal approach for report generation from chest x-ray.
Our approach features two distinct modules: (i) Learned knowledge base and (ii) Multi-modal alignment.
With the aid of both modules, our approach clearly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-12-30T10:43:56Z) - Improving Joint Learning of Chest X-Ray and Radiology Report by Word
Region Alignment [9.265044250068554]
This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports.
The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching.
arXiv Detail & Related papers (2021-09-04T22:58:35Z) - Cross-Modality Brain Tumor Segmentation via Bidirectional
Global-to-Local Unsupervised Domain Adaptation [61.01704175938995]
In this paper, we propose a novel Bidirectional Global-to-Local (BiGL) adaptation framework under a UDA scheme.
Specifically, a bidirectional image synthesis and segmentation module is proposed to segment the brain tumor.
The proposed method outperforms several state-of-the-art unsupervised domain adaptation methods by a large margin.
arXiv Detail & Related papers (2021-05-17T10:11:45Z) - Cross-Modal Contrastive Learning for Abnormality Classification and
Localization in Chest X-rays with Radiomics using a Feedback Loop [63.81818077092879]
We propose an end-to-end semi-supervised cross-modal contrastive learning framework for medical images.
We first apply an image encoder to classify the chest X-rays and to generate the image features.
The radiomic features are then passed through another dedicated encoder to act as the positive sample for the image features generated from the same chest X-ray.
arXiv Detail & Related papers (2021-04-11T09:16:29Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z) - MetricUNet: Synergistic Image- and Voxel-Level Learning for Precise CT
Prostate Segmentation via Online Sampling [66.01558025094333]
We propose a two-stage framework, with the first stage to quickly localize the prostate region and the second stage to precisely segment the prostate.
We introduce a novel online metric learning module through voxel-wise sampling in the multi-task network.
Our method can effectively learn more representative voxel-level features compared with the conventional learning methods with cross-entropy or Dice loss.
arXiv Detail & Related papers (2020-05-15T10:37:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.