Related papers: Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction

Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction

URL: http://arxiv.org/abs/2602.17689v1
Date: Fri, 06 Feb 2026 01:20:56 GMT
Title: Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction
Authors: Melika Filvantorkaman, Mohsen Piri,
Abstract summary: We propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates objectives into masked vision-language learning.<n>We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO)<n>Our results show that explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing multi-modal pre-training methods largely overlook robustness, treating it as a downstream adaptation problem. In this work, we propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates robustness objectives into masked vision-language learning. Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations. We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD, SLAKE, VQA-2019), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO). Robust-MMR achieves 78.9% cross-domain accuracy on VQA-RAD, outperforming the strongest baseline by 3.8 percentage points, and reaches 74.6% and 77.0% accuracy on SLAKE and VQA-2019, respectively. Under perturbed evaluation, Robust-MMR improves VQA-RAD accuracy from 69.1% to 75.6%. For image-text classification, cross-domain MELINDA accuracy increases from 70.3% to 75.2%, while retrieval experiments show a reduction in mean rank degradation from over 16 to 4.1 under perturbation. Qualitative results further demonstrate improved clinical reasoning for disease detection and structural abnormality assessment. These findings show that explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.

Related papers

A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z)
SAMM2D: Scale-Aware Multi-Modal 2D Dual-Encoder for High-Sensitivity Intracrania Aneurysm Screening [0.0]
We introduce SAMM2D, a dual-encoder framework that achieves an AUC of 0.686 on the RSNA intracranial aneurysm dataset.<n>Our results suggest that future medical imaging could benefit more from strong pretraining than from increasingly complex augmentation pipelines.
arXiv Detail & Related papers (2025-12-20T01:44:30Z)
Uncertainty-Aware Domain Adaptation for Vitiligo Segmentation in Clinical Photographs [4.19421520851419]
Accurately quantifying vitiligo extent in routine clinical photographs is crucial for longitudinal monitoring of treatment response.<n>We propose a data-efficient training strategy combining domain-adaptive pre-training on the ISIC 2019 dataset with an ROI-based dual-task loss to suppress background noise.<n>Our framework demonstrates high reliability with zero catastrophic failures and provides interpretable entropy maps to identify ambiguous regions for clinician review.
arXiv Detail & Related papers (2025-12-12T18:56:21Z)
Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis [2.6554246520306624]
Mask What Matters is a controllable text-guided masking framework for self-supervised medical image analysis.<n>It consistently outperforms existing MIM methods, achieving gains of up to +3.1 percentage points in classification accuracy.<n>It achieves these improvements with substantially lower overall masking ratios.
arXiv Detail & Related papers (2025-09-27T02:26:56Z)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z)
Image Quality Assessment for Machines: Paradigm, Large-scale Database, and Models [60.356842878501254]
Machine vision systems (MVS) are intrinsically vulnerable to performance degradation under adverse visual conditions.<n>We propose a machine-centric image quality assessment (MIQA) framework that quantifies the impact of image degradations on MVS performance.
arXiv Detail & Related papers (2025-08-27T13:07:24Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable? [0.9626666671366837]
We introduce MediMeta-C, a corruption benchmark that applies several perturbations across multiple medical imaging datasets.<n>We propose RobustMedCLIP, a visual encoder adaptation of a pretrained MVLM that incorporates few-shot tuning to enhance resilience against corruptions.
arXiv Detail & Related papers (2025-05-21T12:08:31Z)
Metrics that matter: Evaluating image quality metrics for medical image generation [48.85783422900129]
This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data.<n>We evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, morphological alterations designed to mimic clinically relevant inaccuracies.
arXiv Detail & Related papers (2025-05-12T01:57:25Z)
Robust and Generalisable Segmentation of Subtle Epilepsy-causing Lesions: a Graph Convolutional Approach [1.180462901068842]
Focal cortical dysplasia (FCD) is a leading cause of drug-resistant epilepsy, which can be cured by surgery. "Ground truth" manual lesion masks are therefore expensive, limited and have large inter-rater variability. We propose to approach the problem as semantic segmentation using graph convolutional networks (GCN), which allows our model to learn spatial relationships between brain regions.
arXiv Detail & Related papers (2023-06-02T08:56:56Z)
Automated SSIM Regression for Detection and Quantification of Motion Artefacts in Brain MR Images [54.739076152240024]
Motion artefacts in magnetic resonance brain images are a crucial issue. The assessment of MR image quality is fundamental before proceeding with the clinical diagnosis. An automated image quality assessment based on the structural similarity index (SSIM) regression has been proposed here.
arXiv Detail & Related papers (2022-06-14T10:16:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.