MvCo-DoT:Multi-View Contrastive Domain Transfer Network for Medical
Report Generation
- URL: http://arxiv.org/abs/2304.07465v1
- Date: Sat, 15 Apr 2023 03:42:26 GMT
- Title: MvCo-DoT:Multi-View Contrastive Domain Transfer Network for Medical
Report Generation
- Authors: Ruizhi Wang, Xiangtao Wang, Zhenghua Xu, Wenting Xu, Junyang Chen,
Thomas Lukasiewicz
- Abstract summary: We propose the first multi-view medical report generation model, called MvCo-DoT.
MvCo-DoT first propose a multi-view contrastive learning (MvCo) strategy to help the deep reinforcement learning based model utilize the consistency of multi-view inputs.
Extensive experiments on the IU X-Ray public dataset show that MvCo-DoT outperforms the SOTA medical report generation baselines in all metrics.
- Score: 42.804058630251305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In clinical scenarios, multiple medical images with different views are
usually generated at the same time, and they have high semantic consistency.
However, the existing medical report generation methods cannot exploit the rich
multi-view mutual information of medical images. Therefore, in this work, we
propose the first multi-view medical report generation model, called MvCo-DoT.
Specifically, MvCo-DoT first propose a multi-view contrastive learning (MvCo)
strategy to help the deep reinforcement learning based model utilize the
consistency of multi-view inputs for better model learning. Then, to close the
performance gaps of using multi-view and single-view inputs, a domain transfer
network is further proposed to ensure MvCo-DoT achieve almost the same
performance as multi-view inputs using only single-view inputs.Extensive
experiments on the IU X-Ray public dataset show that MvCo-DoT outperforms the
SOTA medical report generation baselines in all metrics.
Related papers
- MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation [40.9095393430871]
We introduce MedViLaM, a unified vision-language model towards a generalist model for medical data.
MedViLaM can flexibly encode and interpret various forms of medical data, including clinical language and imaging.
We present instances of zero-shot generalization to new medical concepts and tasks, effective transfer learning across different tasks, and the emergence of zero-shot medical reasoning.
arXiv Detail & Related papers (2024-09-29T12:23:10Z) - MOSMOS: Multi-organ segmentation facilitated by medical report supervision [10.396987980136602]
We propose a novel pre-training & fine-tuning framework for Multi-Organ Supervision (MOS)
Specifically, we first introduce global contrastive learning to align medical image-report pairs in the pre-training stage.
To remedy the discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags.
arXiv Detail & Related papers (2024-09-04T03:46:17Z) - Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - MV-Swin-T: Mammogram Classification with Multi-view Swin Transformer [0.257133335028485]
We propose an innovative multi-view network based on transformers to address challenges in mammographic image classification.
Our approach introduces a novel shifted window-based dynamic attention block, facilitating the effective integration of multi-view information.
arXiv Detail & Related papers (2024-02-26T04:41:04Z) - C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT)
C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z) - SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image
Segmentation [32.092182889440814]
We present Masked Multi-view with Swin Transformers (SwinMM), a novel multi-view pipeline for medical image analysis.
In the pre-training phase, we deploy a masked multi-view encoder devised to concurrently train masked multi-view observations.
A new task capitalizes on the consistency between predictions from various perspectives, enabling the extraction of hidden multi-view information.
arXiv Detail & Related papers (2023-07-24T08:06:46Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - Cross-Modal Information Maximization for Medical Imaging: CMIM [62.28852442561818]
In hospitals, data are siloed to specific information systems that make the same information available under different modalities.
This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.
We propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time.
arXiv Detail & Related papers (2020-10-20T20:05:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.