Related papers: LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?

LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?

URL: http://arxiv.org/abs/2503.07487v2
Date: Thu, 25 Sep 2025 15:17:42 GMT
Title: LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?
Authors: Bangyan Li, Wenxuan Huang, Zhenkun Gao, Yeqiang Wang, Yunhang Shen, Jingzhong Lin, Ling You, Yuxiang Shen, Shaohui Lin, Wanli Ouyang, Yuling Sun,
Abstract summary: We propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition via utilizing the existing MLLM features.<n>Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture.<n>We also introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models.
Score: 59.81732629438753
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual understanding and reasoning across various vision-language tasks. However, we found that MLLMs cannot process effectively from fine-grained medical image data in the traditional Visual Question Answering (VQA) pipeline, as they do not exploit the captured features and available medical knowledge fully, results in MLLMs usually performing poorly in zero-shot medical disease recognition. Fortunately, this limitation does not indicate that MLLMs are fundamentally incapable of addressing fine-grained recognition tasks. From a feature representation perspective, MLLMs demonstrate considerable potential for tackling such challenging problems. Thus, to address this challenge, we propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition via utilizing the existing MLLM features. Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture and incorporate modality-specific tokens tailored for different modalities. Additionally, we introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models, which mitigates the category semantic gap in image-text alignment. Extensive experiments demonstrate that our LLaVA-RadZ significantly outperforms traditional MLLMs in zero-shot disease recognition, achieving the comparable performance to the well-established and highly-optimized CLIP-based approaches.

Related papers

Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models [43.46006663176283]
This work systematically analyzes the impact of various perturbations on medical MLLMs.<n>For the visual modality, we propose a Perturbation-aware Denoising (PDC) which leverages MLLMs' own vision encoder to identify noise patterns.<n>For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text.
arXiv Detail & Related papers (2025-12-26T10:23:30Z)
Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models [58.91911788912665]
We propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discrimi visual representations.<n>Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.
arXiv Detail & Related papers (2025-12-06T04:20:13Z)
GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification [4.922864692096282]
Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification.<n>Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge.<n>We propose a vision-language MIL framework with two key contributions.
arXiv Detail & Related papers (2025-08-02T09:59:39Z)
MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration [57.98393950821579]
We introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM)<n>Inspired by our empirical findings, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director.<n>This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases.
arXiv Detail & Related papers (2025-06-24T17:52:43Z)
Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain) [22.244699182222824]
Transformer-based language models, though not explicitly trained to mimic brain recordings, have demonstrated surprising alignment with brain activity.<n>Recently, a new class of instruction-tuned multimodal LLMs have emerged, showing remarkable zero-shot capabilities in open-ended multimodal vision tasks.<n>We investigate whether MLLMs, when prompted with natural instructions, lead to better brain alignment and effectively capture instruction-specific representations.
arXiv Detail & Related papers (2025-05-26T14:18:15Z)
Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging [4.341503087761129]
Conducting multimodal learning involves visual and text modalities shown as a solution, but collecting paired vision-language datasets is expensive and time-consuming. Inspired by the superior ability in numerous cross-modal tasks for Large Language Models (LLMs), we proposed a novel Vision-LLM union framework to address the issues.
arXiv Detail & Related papers (2025-04-09T23:33:35Z)
FunBench: Benchmarking Fundus Reading Skills of MLLMs [11.082273291462869]
Multimodal Large Language Models (MLLMs) have shown significant potential in medical image analysis.<n>Existing benchmarks lack fine-grained task divisions and fail to provide modular analysis of its two key modules, i.e., large language model (LLM) and vision encoder (VE)<n>This paper introduces FunBench, a novel visual question answering (VQA) benchmark designed to comprehensively evaluate MLLMs' fundus reading skills.
arXiv Detail & Related papers (2025-03-02T14:00:24Z)
Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis [0.1984949535188529]
Vision language models (VLMs) address this by incorporating a pretrained vision backbone for processing images and a cross-modal projector. We developed intelligent assistants finetuned from LLaVA models to enhance multimodal understanding in low-dose radiation therapy.
arXiv Detail & Related papers (2025-01-26T02:48:01Z)
Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding [92.32881381717594]
We introduce ALternate Contrastive Decoding (ALCD) to solve hallucination issues in medical information extraction tasks. ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods.
arXiv Detail & Related papers (2024-10-21T07:19:19Z)
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.41055673919895]
This study explores the design space for MLLMs using a mixture of vision encoders and resolutions. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z)
XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare [16.79952669254101]
We develop a novel method for zero-shot/few-shot in-context learning (ICL) using a multi-layered structured prompt. We also explore the efficacy of two communication styles between the user and Large Language Models (LLMs) Our study systematically evaluates the diagnostic accuracy and risk factors, including gender bias and false negative rates.
arXiv Detail & Related papers (2024-05-10T06:52:44Z)
RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning [14.366349078707263]
RJUA-MedDQA is a comprehensive benchmark in the field of medical specialization. This work introduces RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization.
arXiv Detail & Related papers (2024-02-19T06:57:02Z)
Large Language Model Distilling Medication Recommendation Model [58.94186280631342]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs)<n>Our research aims to transform existing medication recommendation methodologies using LLMs.<n>To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z)
MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge. Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z)
Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models. By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z)
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models [36.41816380074965]
We investigate the effectiveness of different vision encoders within Large Language Models (MLLMs) Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding. We propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging.
arXiv Detail & Related papers (2023-10-13T02:41:55Z)
Redefining Digital Health Interfaces with Large Language Models [69.02059202720073]
Large Language Models (LLMs) have emerged as general-purpose models with the ability to process complex information. We show how LLMs can provide a novel interface between clinicians and digital technologies. We develop a new prognostic tool using automated machine learning.
arXiv Detail & Related papers (2023-10-05T14:18:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.