Related papers: Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease

Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease

URL: http://arxiv.org/abs/2509.07613v3
Date: Wed, 15 Oct 2025 05:06:34 GMT
Title: Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease
Authors: Fangqi Cheng, Surajit Ray, Xiaochen Yang,
Abstract summary: Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering.<n>Most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs.<n>We propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI.
Score: 3.46857682956989
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer's disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on ADNI with only 1,504 training MRIs, outperforming methods trained on 27,161 MRIs, and shows strong zero-shot generalization on OASIS-2 and AIBL. Code is available at https://github.com/CFQ666312/DEFT-VLM-AD.

Related papers

Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification [69.87877580725768]
Multimodal Visual Surrogate Compression (MVSC) learns to compress and adapt large 3D sMRI volumes into compact 2D features.<n>MVSC has two key components: a Volume Context that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner.
arXiv Detail & Related papers (2026-01-29T13:05:46Z)
MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities [89.81463562506637]
We introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging.<n>By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures.
arXiv Detail & Related papers (2025-11-25T18:59:53Z)
DinoAtten3D: Slice-Level Attention Aggregation of DinoV2 for 3D Brain MRI Anomaly Classification [2.731729370870452]
Anomaly detection and classification in medical imaging are critical for early diagnosis but remain challenging due to limited annotated data, class imbalance, and the high cost of expert labeling.<n>We propose an attention-based global aggregation framework tailored specifically for 3D medical image anomaly classification.
arXiv Detail & Related papers (2025-09-15T23:31:40Z)
M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision [24.846428105192405]
We train M3Ret, a unified visual encoder, without any modality-specific customization.<n>It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms.<n>Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP.
arXiv Detail & Related papers (2025-09-01T10:59:39Z)
VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine [11.993301266706139]
We propose a vision-language pre-training framework, termed as textbfVELVET-Med, specifically designed for limited volumetric data such as 3D CT and associated radiology reports.<n>Our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives.<n>The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks.
arXiv Detail & Related papers (2025-08-16T17:08:43Z)
MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training [7.968487067774351]
3D medical image analysis is pivotal in numerous clinical applications.<n>Large-scale vision-language pre-training remains underexplored in 3D medical image analysis.<n>We propose MG-3D, pre-trained on large-scale data (47.1K)
arXiv Detail & Related papers (2024-12-08T09:45:59Z)
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z)
Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [68.42215385041114]
This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models.
arXiv Detail & Related papers (2024-03-19T09:28:19Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [116.87918100031153]
We propose a Cross-modal clinical Graph Transformer (CGT) for ophthalmic report generation (ORG) CGT injects clinical relation triples into the visual features as prior knowledge to drive the decoding procedure. Experiments on the large-scale FFA-IR benchmark demonstrate that the proposed CGT is able to outperform previous benchmark methods.
arXiv Detail & Related papers (2022-06-04T13:16:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.