Related papers: Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis

Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis

URL: http://arxiv.org/abs/2601.14774v1
Date: Wed, 21 Jan 2026 08:53:40 GMT
Title: Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis
Authors: Keita Takeda, Tomoya Sakai,
Abstract summary: This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs)<n>Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks.<n>Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs.
Score: 2.243145970857166
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs). While medical VLMs are expected to capture diagnostically relevant features, their learned representations remain underexplored, and standard evaluations like classification accuracy do not fully reveal if they acquire truly discriminative, lesion-specific features. Understanding these representations is crucial for revealing medical image structures and improving downstream tasks in medical image analysis. This study aims to investigate the feature distributions learned by medical VLMs and evaluate the impact of medical specialization. We analyze the feature distribution of multiple image modalities extracted by some representative medical VLMs across lesion classification datasets on multiple modalities. These distributions were compared them with non-medical VLMs to assess the domain-specific medical training. Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks. Moreover, it was found that non-medical VLMs with recent improvement with contextual enrichment such as LLM2CLIP produce more refined feature representations. Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs. Notably, non-medical models are particularly vulnerable to biases introduced by overlaied text strings on images. These findings underscore the need for careful consideration on model selection according to downstream tasks besides potential risks in inference due to background biases such as textual information in images.

Related papers

Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z)
Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images [8.797134639962982]
We evaluate the ability of state-of-the-art Vision-Language Models (VLMs) to accurately determine relative positions on medical images.<n>We investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance.<n>Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content.
arXiv Detail & Related papers (2025-08-01T11:44:06Z)
Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models [3.3091869879941687]
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding.<n>We reformulate each task into instruction-based prompts suitable for vision-language reasoning.<n>Results show that multi-task training improves robustness and accuracy.
arXiv Detail & Related papers (2025-05-22T13:18:44Z)
RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models [17.579521693647383]
We introduce textitRetinalGPT, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images.<n>In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases.
arXiv Detail & Related papers (2025-03-06T00:19:54Z)
Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback [57.98393950821579]
We propose a novel UMed-LVLM designed to unveil medical abnormalities.<n>We propose a prompt method utilizing the GPT-4V to generate diagnoses based on identified abnormal areas in medical images.<n>Our UMed-LVLM significantly outperforms existing Med-LVLMs in identifying and understanding medical abnormalities.
arXiv Detail & Related papers (2025-01-02T17:37:20Z)
A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z)
Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder [26.830574964308962]
We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis. We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data. Lastly, we validate using language will improve the zero-shot performance for the medical image analysis.
arXiv Detail & Related papers (2024-03-07T16:11:43Z)
Optimizing Skin Lesion Classification via Multimodal Data and Auxiliary Task Integration [54.76511683427566]
This research introduces a novel multimodal method for classifying skin lesions, integrating smartphone-captured images with essential clinical and demographic information. A distinctive aspect of this method is the integration of an auxiliary task focused on super-resolution image prediction. The experimental evaluations have been conducted using the PAD-UFES20 dataset, applying various deep-learning architectures.
arXiv Detail & Related papers (2024-02-16T05:16:20Z)
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark. All images in this benchmark are sourced from authentic medical scenarios. We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z)
FeaInfNet: Diagnosis in Medical Image with Feature-Driven Inference and Visual Explanations [4.022446255159328]
Interpretable deep learning models have received widespread attention in the field of image recognition. Many interpretability models that have been proposed still have problems of insufficient accuracy and interpretability in medical image disease diagnosis. We propose feature-driven inference network (FeaInfNet) to solve these problems.
arXiv Detail & Related papers (2023-12-04T13:09:00Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.