Related papers: Unified Multi-modal Diagnostic Framework with Reconstruction Pre-training and Heterogeneity-combat Tuning

Unified Multi-modal Diagnostic Framework with Reconstruction Pre-training and Heterogeneity-combat Tuning

URL: http://arxiv.org/abs/2404.06057v1
Date: Tue, 9 Apr 2024 06:47:44 GMT
Title: Unified Multi-modal Diagnostic Framework with Reconstruction Pre-training and Heterogeneity-combat Tuning
Authors: Yupei Zhang, Li Pan, Qiushi Yang, Tan Li, Zhen Chen,
Abstract summary: We propose a Unified Medical Multi-modal Diagnostic (UMD) framework with tailored pre-training and downstream tuning strategies. Specifically, we propose the Multi-level Reconstruction Pre-training (MR-Pretrain) strategy, which guides models to capture the semantic information from masked inputs of different modalities. In particular, TD-Calib fine-tunes the pre-trained model regarding the distribution of downstream datasets, and GM-Coord adjusts the gradient weights according to the dynamic optimization status of different modalities.
Score: 14.556686415877602
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical multi-modal pre-training has revealed promise in computer-aided diagnosis by leveraging large-scale unlabeled datasets. However, existing methods based on masked autoencoders mainly rely on data-level reconstruction tasks, but lack high-level semantic information. Furthermore, two significant heterogeneity challenges hinder the transfer of pre-trained knowledge to downstream tasks, \textit{i.e.}, the distribution heterogeneity between pre-training data and downstream data, and the modality heterogeneity within downstream data. To address these challenges, we propose a Unified Medical Multi-modal Diagnostic (UMD) framework with tailored pre-training and downstream tuning strategies. Specifically, to enhance the representation abilities of vision and language encoders, we propose the Multi-level Reconstruction Pre-training (MR-Pretrain) strategy, including a feature-level and data-level reconstruction, which guides models to capture the semantic information from masked inputs of different modalities. Moreover, to tackle two kinds of heterogeneities during the downstream tuning, we present the heterogeneity-combat downstream tuning strategy, which consists of a Task-oriented Distribution Calibration (TD-Calib) and a Gradient-guided Modality Coordination (GM-Coord). In particular, TD-Calib fine-tunes the pre-trained model regarding the distribution of downstream datasets, and GM-Coord adjusts the gradient weights according to the dynamic optimization status of different modalities. Extensive experiments on five public medical datasets demonstrate the effectiveness of our UMD framework, which remarkably outperforms existing approaches on three kinds of downstream tasks.

Related papers

Learning from Heterogeneous Structural MRI via Collaborative Domain Adaptation for Late-Life Depression Assessment [24.340328016766183]
We propose a Collaborative Domain Adaptation framework for LLD detection using T1-weighted MRIs.<n>The framework consists of three stages: supervised training on labeled source data, self-supervised target feature adaptation and collaborative training on unlabeled target data.<n>Experiments conducted on multi-site T1-weighted MRI data demonstrate that the framework consistently outperforms state-of-the-art unsupervised domain adaptation methods.
arXiv Detail & Related papers (2025-07-30T01:38:32Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Towards Robust Multimodal Physiological Foundation Models: Handling Arbitrary Missing Modalities [9.785262633953794]
Physio Omni is a foundation model for multimodal physiological signal analysis. It trains a decoupled multimodal tokenizer, enabling masked signal pre-training. It achieves state-of-the-art performance while maintaining strong robustness to missing modalities.
arXiv Detail & Related papers (2025-04-28T09:00:04Z)
Incomplete Modality Disentangled Representation for Ophthalmic Disease Grading and Diagnosis [16.95583564875497]
We propose an Incomplete Modality Disentangled Representation (IMDR) strategy to disentangle features into explicit independent modal-common and modal-specific features. Experiments on four multimodal datasets demonstrate that the proposed IMDR outperforms the state-of-the-art methods significantly.
arXiv Detail & Related papers (2025-02-17T12:10:35Z)
Continually Evolved Multimodal Foundation Models for Cancer Prognosis [50.43145292874533]
Cancer prognosis is a critical task that involves predicting patient outcomes and survival rates. Previous studies have integrated diverse data modalities, such as clinical notes, medical images, and genomic data, leveraging their complementary information. Existing approaches face two major limitations. First, they struggle to incorporate newly arrived data with varying distributions into training, such as patient records from different hospitals. Second, most multimodal integration methods rely on simplistic concatenation or task-specific pipelines, which fail to capture the complex interdependencies across modalities.
arXiv Detail & Related papers (2025-01-30T06:49:57Z)
AMM-Diff: Adaptive Multi-Modality Diffusion Network for Missing Modality Imputation [2.8498944632323755]
In clinical practice, full imaging is not always feasible, often due to complex acquisition protocols, stringent privacy regulations, or specific clinical needs. A promising solution is missing data imputation, where absent modalities are generated from available ones. We propose an Adaptive Multi-Modality Diffusion Network (AMM-Diff), a novel diffusion-based generative model capable of handling any number of input modalities and generating the missing ones.
arXiv Detail & Related papers (2025-01-22T12:29:33Z)
HG-Adapter: Improving Pre-Trained Heterogeneous Graph Neural Networks with Dual Adapters [53.97380482341493]
"pre-train, prompt-tuning" has demonstrated impressive performance for tuning pre-trained heterogeneous graph neural networks (HGNNs) We propose a unified framework that combines two new adapters with potential labeled data extension to improve the generalization of pre-trained HGNN models.
arXiv Detail & Related papers (2024-11-02T06:43:54Z)
Repurposing Foundation Model for Generalizable Medical Time Series Classification [16.21546283978257]
FORMED is a framework for repurposing a backbone foundation model to enable highly generalizable MedTS classification on unseen datasets.<n>We evaluate FORMED on 5 diverse MedTS datasets, benchmarking against 11 Task-Specific Models (TSM) and 4 Task-Specific Adaptation (TSA) methods.<n>Our results demonstrate FORMED's dominant performance, achieving up to 35% absolute improvement in F1-score (on ADFTD dataset) over specialized baselines.
arXiv Detail & Related papers (2024-10-03T23:50:04Z)
Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification [2.5091334993691206]
Development of a robust deep-learning model for retinal disease diagnosis requires a substantial dataset for training. The capacity to generalize effectively on smaller datasets remains a persistent challenge. We've combined a wide range of data sources to improve performance and generalization to new data.
arXiv Detail & Related papers (2024-09-17T17:22:35Z)
PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation. Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process. Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
Multi-Modal Federated Learning for Cancer Staging over Non-IID Datasets with Unbalanced Modalities [9.476402318365446]
In this work, we introduce a novel FL architecture designed to accommodate not only the heterogeneity of data samples, but also the inherent heterogeneity/non-uniformity of data modalities across institutions. We propose a solution by devising a distributed gradient blending and proximity-aware client weighting strategy tailored for multi-modal FL.
arXiv Detail & Related papers (2024-01-07T23:45:01Z)
Masked Contrastive Reconstruction for Cross-modal Medical Image-Report Retrieval [3.5314225883644945]
Cross-modal medical image-report retrieval task plays a significant role in clinical diagnosis and various medical generative tasks. We propose an efficient framework named Masked Contrastive and Reconstruction (MCR), which takes masked data as the sole input for both tasks. This enhances task connections, reducing information interference and competition between them, while also substantially decreasing the required GPU memory and training time.
arXiv Detail & Related papers (2023-12-26T01:14:10Z)
Dynamic Multimodal Information Bottleneck for Multimodality Classification [26.65073424377933]
We propose a dynamic multimodal information bottleneck framework for attaining a robust fused feature representation. Specifically, our information bottleneck module serves to filter out the task-irrelevant information and noises in the fused feature. Our method surpasses the state-of-the-art and is significantly more robust, being the only method to remain performance when large-scale noisy channels exist.
arXiv Detail & Related papers (2023-11-02T08:34:08Z)
ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models [69.9178140563928]
Colonoscopy analysis is essential for assisting clinical diagnosis and treatment. The scarcity of annotated data limits the effectiveness and generalization of existing methods. We propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks.
arXiv Detail & Related papers (2023-09-03T07:55:46Z)
Lung Cancer Risk Estimation with Incomplete Data: A Joint Missing Imputation Perspective [5.64530854079352]
We address imputation of missing data by modeling the joint distribution of multi-modal data. Motivated by partial bidirectional generative adversarial net (PBiGAN), we propose a new Conditional PBiGAN (C-PBiGAN) method. C-PBiGAN achieves significant improvements in lung cancer risk estimation compared with representative imputation methods.
arXiv Detail & Related papers (2021-07-25T20:15:16Z)
G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers. We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z)
MS-Net: Multi-Site Network for Improving Prostate Segmentation with Heterogeneous MRI Data [75.73881040581767]
We propose a novel multi-site network (MS-Net) for improving prostate segmentation by learning robust representations. Our MS-Net improves the performance across all datasets consistently, and outperforms state-of-the-art methods for multi-site learning.
arXiv Detail & Related papers (2020-02-09T14:11:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.