Related papers: SMFusion: Semantic-Preserving Fusion of Multimodal Medical Images for Enhanced Clinical Diagnosis

SMFusion: Semantic-Preserving Fusion of Multimodal Medical Images for Enhanced Clinical Diagnosis

URL: http://arxiv.org/abs/2505.12251v1
Date: Sun, 18 May 2025 06:15:00 GMT
Title: SMFusion: Semantic-Preserving Fusion of Multimodal Medical Images for Enhanced Clinical Diagnosis
Authors: Haozhe Xiang, Han Zhang, Yu Cheng, Xiongwen Quan, Wanwan Huang,
Abstract summary: We propose a novel semantic-guided medical image fusion approach that incorporates medical prior knowledge into the fusion process.<n>We generate diagnostic reports from the fused images to assess the preservation of medical information.<n> Experimental results on test datasets demonstrate that the proposed method achieves superior performance in both qualitative and quantitative evaluations.
Score: 11.356721356096564
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal medical image fusion plays a crucial role in medical diagnosis by integrating complementary information from different modalities to enhance image readability and clinical applicability. However, existing methods mainly follow computer vision standards for feature extraction and fusion strategy formulation, overlooking the rich semantic information inherent in medical images. To address this limitation, we propose a novel semantic-guided medical image fusion approach that, for the first time, incorporates medical prior knowledge into the fusion process. Specifically, we construct a publicly available multimodal medical image-text dataset, upon which text descriptions generated by BiomedGPT are encoded and semantically aligned with image features in a high-dimensional space via a semantic interaction alignment module. During this process, a cross attention based linear transformation automatically maps the relationship between textual and visual features to facilitate comprehensive learning. The aligned features are then embedded into a text-injection module for further feature-level fusion. Unlike traditional methods, we further generate diagnostic reports from the fused images to assess the preservation of medical information. Additionally, we design a medical semantic loss function to enhance the retention of textual cues from the source images. Experimental results on test datasets demonstrate that the proposed method achieves superior performance in both qualitative and quantitative evaluations while preserving more critical medical information.

Related papers

Fuse4Seg: Image-Level Fusion Based Multi-Modality Medical Image Segmentation [13.497613339200184]
We argue the current feature-level fusion strategy is prone to semantic inconsistencies and misalignments. We introduce a novel image-level fusion based multi-modality medical image segmentation method, Fuse4Seg. The resultant fused image is a coherent representation that accurately amalgamates information from all modalities.
arXiv Detail & Related papers (2024-09-16T14:39:04Z)
Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z)
MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge. Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z)
AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis [1.64647940449869]
We propose a transformer-based framework, called Alifuse, for aligning and fusing multimodal medical data.<n>We convert medical images and both unstructured and structured clinical records into vision and language tokens.<n>We apply Alifuse to classify Alzheimer's disease, achieving state-of-the-art performance on five public datasets and outperforming eight baselines.
arXiv Detail & Related papers (2024-01-02T07:28:21Z)
Radiology Report Generation Using Transformers Conditioned with Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information. The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z)
Multi-modal Medical Neurological Image Fusion using Wavelet Pooled Edge Preserving Autoencoder [3.3828292731430545]
This paper presents an end-to-end unsupervised fusion model for multimodal medical images based on an edge-preserving dense autoencoder network. In the proposed model, feature extraction is improved by using wavelet decomposition-based attention pooling of feature maps. The proposed model is trained on a variety of medical image pairs which helps in capturing the intensity distributions of the source images.
arXiv Detail & Related papers (2023-10-18T11:59:35Z)
A New Multimodal Medical Image Fusion based on Laplacian Autoencoder with Channel Attention [3.1531360678320897]
Deep learning models have achieved end-to-end image fusion with highly robust and accurate performance. Most DL-based fusion models perform down-sampling on the input images to minimize the number of learnable parameters and computations. We propose a new multimodal medical image fusion model is proposed that is based on integrated Laplacian-Gaussian concatenation with attention pooling.
arXiv Detail & Related papers (2023-10-18T11:29:53Z)
Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework. We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z)
An Attention-based Multi-Scale Feature Learning Network for Multimodal Medical Image Fusion [24.415389503712596]
Multimodal medical images could provide rich information about patients for physicians to diagnose. The image fusion technique is able to synthesize complementary information from multimodal images into a single image. We introduce a novel Dilated Residual Attention Network for the medical image fusion task.
arXiv Detail & Related papers (2022-12-09T04:19:43Z)
Semantic segmentation of multispectral photoacoustic images using deep learning [53.65837038435433]
Photoacoustic imaging has the potential to revolutionise healthcare. Clinical translation of the technology requires conversion of the high-dimensional acquired data into clinically relevant and interpretable information. We present a deep learning-based approach to semantic segmentation of multispectral photoacoustic images.
arXiv Detail & Related papers (2021-05-20T09:33:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.