C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network
- URL: http://arxiv.org/abs/2310.05355v1
- Date: Mon, 9 Oct 2023 02:31:36 GMT
- Title: C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network
- Authors: Ruizhi Wang, Xiangtao Wang, Jie Zhou, Thomas Lukasiewicz, Zhenghua Xu
- Abstract summary: We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT)
C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
- Score: 67.97926983664676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In clinical scenarios, multiple medical images with different views are
usually generated simultaneously, and these images have high semantic
consistency. However, most existing medical report generation methods only
consider single-view data. The rich multi-view mutual information of medical
images can help generate more accurate reports, however, the dependence of
multi-view models on multi-view data in the inference stage severely limits
their application in clinical practice. In addition, word-level optimization
based on numbers ignores the semantics of reports and medical images, and the
generated reports often cannot achieve good performance. Therefore, we propose
a cross-modal consistent multi-view medical report generation with a domain
transfer network (C^2M-DoT). Specifically, (i) a semantic-based multi-view
contrastive learning medical report generation framework is adopted to utilize
cross-view information to learn the semantic representation of lesions; (ii) a
domain transfer network is further proposed to ensure that the multi-view
report generation model can still achieve good inference performance under
single-view input; (iii) meanwhile, optimization using a cross-modal
consistency loss facilitates the generation of textual reports that are
semantically consistent with medical images. Extensive experimental studies on
two public benchmark datasets demonstrate that C^2M-DoT substantially
outperforms state-of-the-art baselines in all metrics. Ablation studies also
confirmed the validity and necessity of each component in C^2M-DoT.
Related papers
- MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation [40.9095393430871]
We introduce MedViLaM, a unified vision-language model towards a generalist model for medical data.
MedViLaM can flexibly encode and interpret various forms of medical data, including clinical language and imaging.
We present instances of zero-shot generalization to new medical concepts and tasks, effective transfer learning across different tasks, and the emergence of zero-shot medical reasoning.
arXiv Detail & Related papers (2024-09-29T12:23:10Z) - MOSMOS: Multi-organ segmentation facilitated by medical report supervision [10.396987980136602]
We propose a novel pre-training & fine-tuning framework for Multi-Organ Supervision (MOS)
Specifically, we first introduce global contrastive learning to align medical image-report pairs in the pre-training stage.
To remedy the discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags.
arXiv Detail & Related papers (2024-09-04T03:46:17Z) - Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling [4.44283662576491]
We present a novel framework based on hypernetworks to fuse clinical imaging and tabular data by conditioning the image processing on the EHR's values and measurements.
We show that our framework outperforms both single-modality models and state-of-the-art MRI-tabular data fusion methods.
arXiv Detail & Related papers (2024-03-20T05:50:04Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - MvCo-DoT:Multi-View Contrastive Domain Transfer Network for Medical
Report Generation [42.804058630251305]
We propose the first multi-view medical report generation model, called MvCo-DoT.
MvCo-DoT first propose a multi-view contrastive learning (MvCo) strategy to help the deep reinforcement learning based model utilize the consistency of multi-view inputs.
Extensive experiments on the IU X-Ray public dataset show that MvCo-DoT outperforms the SOTA medical report generation baselines in all metrics.
arXiv Detail & Related papers (2023-04-15T03:42:26Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - AlignTransformer: Hierarchical Alignment of Visual Regions and Disease
Tags for Medical Report Generation [50.21065317817769]
We propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules.
Experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets.
arXiv Detail & Related papers (2022-03-18T13:43:53Z) - Cross-Modal Information Maximization for Medical Imaging: CMIM [62.28852442561818]
In hospitals, data are siloed to specific information systems that make the same information available under different modalities.
This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.
We propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time.
arXiv Detail & Related papers (2020-10-20T20:05:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.