Cross-modal Prototype Driven Network for Radiology Report Generation
- URL: http://arxiv.org/abs/2207.04818v1
- Date: Mon, 11 Jul 2022 12:29:33 GMT
- Title: Cross-modal Prototype Driven Network for Radiology Report Generation
- Authors: Jun Wang, Abhir Bhalerao, and Yulan He
- Abstract summary: Radiology report generation (RRG) aims to describe automatically a radiology image with human-like language and could potentially support the work of radiologists.
Previous approaches often adopt an encoder-decoder architecture and focus on single-modal feature learning.
Here we propose a Cross-modal PROtotype driven NETwork (XPRONET) to promote cross-modal pattern learning and exploit it to improve the task of radiology report generation.
- Score: 30.029659845237077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Radiology report generation (RRG) aims to describe automatically a radiology
image with human-like language and could potentially support the work of
radiologists, reducing the burden of manual reporting. Previous approaches
often adopt an encoder-decoder architecture and focus on single-modal feature
learning, while few studies explore cross-modal feature interaction. Here we
propose a Cross-modal PROtotype driven NETwork (XPRONET) to promote cross-modal
pattern learning and exploit it to improve the task of radiology report
generation. This is achieved by three well-designed, fully differentiable and
complementary modules: a shared cross-modal prototype matrix to record the
cross-modal prototypes; a cross-modal prototype network to learn the
cross-modal prototypes and embed the cross-modal information into the visual
and textual features; and an improved multi-label contrastive loss to enable
and enhance multi-label prototype learning. XPRONET obtains substantial
improvements on the IU-Xray and MIMIC-CXR benchmarks, where its performance
exceeds recent state-of-the-art approaches by a large margin on IU-Xray and
comparable performance on MIMIC-CXR.
Related papers
- X-VILA: Cross-Modality Alignment for Large Language Model [91.96081978952283]
X-VILA is an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities.
We propose a visual alignment mechanism with a visual embedding highway module to address the problem of visual information loss.
X-VILA exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins.
arXiv Detail & Related papers (2024-05-29T17:59:58Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - MAIRA-1: A specialised large multimodal model for radiology report generation [41.69727330319648]
We present a radiology-specific multimodal model for generating radiological reports from chest X-rays (CXRs)
Our work builds on the idea that large language model(s) can be equipped with multimodal capabilities through alignment with pre-trained vision encoders.
Our proposed model (MAIRA-1) leverages a CXR-specific image encoder in conjunction with a fine-tuned large language model based on Vicuna-7B, and text-based data augmentation, to produce reports with state-of-the-art quality.
arXiv Detail & Related papers (2023-11-22T19:45:40Z) - Cross-Modal Translation and Alignment for Survival Analysis [7.657906359372181]
We present a framework to explore the intrinsic cross-modal correlations and transfer potential complementary information.
Our experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2023-09-22T13:29:14Z) - Cross-modal Orthogonal High-rank Augmentation for RGB-Event
Transformer-trackers [58.802352477207094]
We explore the great potential of a pre-trained vision Transformer (ViT) to bridge the vast distribution gap between two modalities.
We propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively.
Experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two trackersstream to a large extent in terms of both tracking precision and success rate.
arXiv Detail & Related papers (2023-07-09T08:58:47Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation [48.723504098917324]
We propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments.
We introduce three novel modules: Latent Space Unifier, Cross-modal Representation Aligner and Text-to-Image Refiner.
Experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
arXiv Detail & Related papers (2023-03-28T12:42:12Z) - Cross-modal Memory Networks for Radiology Report Generation [30.13916304931662]
Cross-modal memory networks (CMN) are proposed to enhance the encoder-decoder framework for radiology report generation.
Our model is able to better align information from radiology images and texts so as to help generating more accurate reports in terms of clinical indicators.
arXiv Detail & Related papers (2022-04-28T02:32:53Z) - X-modaler: A Versatile and High-performance Codebase for Cross-modal
Analytics [99.03895740754402]
X-modaler encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages.
X-modaler is an Apache-licensed, and its source codes, sample projects and pre-trained models are available on-line.
arXiv Detail & Related papers (2021-08-18T16:05:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.