AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided
Diagnosis
- URL: http://arxiv.org/abs/2401.01074v2
- Date: Sun, 7 Jan 2024 04:14:16 GMT
- Title: AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided
Diagnosis
- Authors: Qiuhui Chen, Yi Hong
- Abstract summary: We propose a transformer-based framework, called Alifuse, for aligning and fusing multi-modal medical data.
We apply Alifuse to classify Alzheimer's disease and obtain state-of-the-art performance on five public datasets.
- Score: 1.9450973046619378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical data collected for making a diagnostic decision are typically
multi-modal and provide complementary perspectives of a subject. A
computer-aided diagnosis system welcomes multi-modal inputs; however, how to
effectively fuse such multi-modal data is a challenging task and attracts a lot
of attention in the medical research field. In this paper, we propose a
transformer-based framework, called Alifuse, for aligning and fusing
multi-modal medical data. Specifically, we convert images and unstructured and
structured texts into vision and language tokens, and use intramodal and
intermodal attention mechanisms to learn holistic representations of all
imaging and non-imaging data for classification. We apply Alifuse to classify
Alzheimer's disease and obtain state-of-the-art performance on five public
datasets, by outperforming eight baselines. The source code will be available
online later.
Related papers
- ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - MOSMOS: Multi-organ segmentation facilitated by medical report supervision [10.396987980136602]
We propose a novel pre-training & fine-tuning framework for Multi-Organ Supervision (MOS)
Specifically, we first introduce global contrastive learning to align medical image-report pairs in the pre-training stage.
To remedy the discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags.
arXiv Detail & Related papers (2024-09-04T03:46:17Z) - Automated Ensemble Multimodal Machine Learning for Healthcare [52.500923923797835]
We introduce a multimodal framework, AutoPrognosis-M, that enables the integration of structured clinical (tabular) data and medical imaging using automated machine learning.
AutoPrognosis-M incorporates 17 imaging models, including convolutional neural networks and vision transformers, and three distinct multimodal fusion strategies.
arXiv Detail & Related papers (2024-07-25T17:46:38Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling [4.44283662576491]
We present a novel framework based on hypernetworks to fuse clinical imaging and tabular data by conditioning the image processing on the EHR's values and measurements.
We show that our framework outperforms both single-modality models and state-of-the-art MRI-tabular data fusion methods.
arXiv Detail & Related papers (2024-03-20T05:50:04Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - Building RadiologyNET: Unsupervised annotation of a large-scale
multimodal medical database [0.4915744683251151]
The usage of machine learning in medical diagnosis and treatment has witnessed significant growth in recent years.
However, the availability of large annotated image datasets remains a major obstacle since the process of annotation is time-consuming and costly.
This paper explores how to automatically annotate a database of medical radiology images with regard to their semantic similarity.
arXiv Detail & Related papers (2023-07-27T13:00:33Z) - Heterogeneous Graph Learning for Multi-modal Medical Data Analysis [6.3082663934391014]
We propose an effective graph-based framework called HetMed for fusing the multi-modal medical data.
HetMed captures the complex relationship between patients in a systematic way, which leads to more accurate clinical decisions.
arXiv Detail & Related papers (2022-11-28T09:14:36Z) - AlignTransformer: Hierarchical Alignment of Visual Regions and Disease
Tags for Medical Report Generation [50.21065317817769]
We propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules.
Experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets.
arXiv Detail & Related papers (2022-03-18T13:43:53Z) - Cross-Modal Information Maximization for Medical Imaging: CMIM [62.28852442561818]
In hospitals, data are siloed to specific information systems that make the same information available under different modalities.
This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.
We propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time.
arXiv Detail & Related papers (2020-10-20T20:05:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.