Cross-modal Prototype Driven Network for Radiology Report Generation
- URL: http://arxiv.org/abs/2207.04818v1
- Date: Mon, 11 Jul 2022 12:29:33 GMT
- Title: Cross-modal Prototype Driven Network for Radiology Report Generation
- Authors: Jun Wang, Abhir Bhalerao, and Yulan He
- Abstract summary: Radiology report generation (RRG) aims to describe automatically a radiology image with human-like language and could potentially support the work of radiologists.
Previous approaches often adopt an encoder-decoder architecture and focus on single-modal feature learning.
Here we propose a Cross-modal PROtotype driven NETwork (XPRONET) to promote cross-modal pattern learning and exploit it to improve the task of radiology report generation.
- Score: 30.029659845237077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Radiology report generation (RRG) aims to describe automatically a radiology
image with human-like language and could potentially support the work of
radiologists, reducing the burden of manual reporting. Previous approaches
often adopt an encoder-decoder architecture and focus on single-modal feature
learning, while few studies explore cross-modal feature interaction. Here we
propose a Cross-modal PROtotype driven NETwork (XPRONET) to promote cross-modal
pattern learning and exploit it to improve the task of radiology report
generation. This is achieved by three well-designed, fully differentiable and
complementary modules: a shared cross-modal prototype matrix to record the
cross-modal prototypes; a cross-modal prototype network to learn the
cross-modal prototypes and embed the cross-modal information into the visual
and textual features; and an improved multi-label contrastive loss to enable
and enhance multi-label prototype learning. XPRONET obtains substantial
improvements on the IU-Xray and MIMIC-CXR benchmarks, where its performance
exceeds recent state-of-the-art approaches by a large margin on IU-Xray and
comparable performance on MIMIC-CXR.
Related papers
- MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks [50.98856172702256]
We propose the Modality-INformed knowledge Distillation (MIND) framework, a multimodal model compression approach.
MIND transfers knowledge from ensembles of pre-trained deep neural networks of varying sizes into a smaller multimodal student.
We evaluate MIND on binary and multilabel clinical prediction tasks using time series data and chest X-ray images.
arXiv Detail & Related papers (2025-02-03T08:50:00Z) - MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report [4.340464264725625]
We introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs) and radiology/cardiology reports.
We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention.
To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach.
arXiv Detail & Related papers (2024-10-21T17:42:41Z) - X-VILA: Cross-Modality Alignment for Large Language Model [91.96081978952283]
X-VILA is an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities.
We propose a visual alignment mechanism with a visual embedding highway module to address the problem of visual information loss.
X-VILA exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins.
arXiv Detail & Related papers (2024-05-29T17:59:58Z) - Cross-Modal Translation and Alignment for Survival Analysis [7.657906359372181]
We present a framework to explore the intrinsic cross-modal correlations and transfer potential complementary information.
Our experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2023-09-22T13:29:14Z) - Cross-modal Orthogonal High-rank Augmentation for RGB-Event
Transformer-trackers [58.802352477207094]
We explore the great potential of a pre-trained vision Transformer (ViT) to bridge the vast distribution gap between two modalities.
We propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively.
Experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two trackersstream to a large extent in terms of both tracking precision and success rate.
arXiv Detail & Related papers (2023-07-09T08:58:47Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation [48.723504098917324]
We propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments.
We introduce three novel modules: Latent Space Unifier, Cross-modal Representation Aligner and Text-to-Image Refiner.
Experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
arXiv Detail & Related papers (2023-03-28T12:42:12Z) - Cross-modal Memory Networks for Radiology Report Generation [30.13916304931662]
Cross-modal memory networks (CMN) are proposed to enhance the encoder-decoder framework for radiology report generation.
Our model is able to better align information from radiology images and texts so as to help generating more accurate reports in terms of clinical indicators.
arXiv Detail & Related papers (2022-04-28T02:32:53Z) - X-modaler: A Versatile and High-performance Codebase for Cross-modal
Analytics [99.03895740754402]
X-modaler encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages.
X-modaler is an Apache-licensed, and its source codes, sample projects and pre-trained models are available on-line.
arXiv Detail & Related papers (2021-08-18T16:05:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.