Related papers: Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative

Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative

URL: http://arxiv.org/abs/2601.02443v1
Date: Mon, 05 Jan 2026 13:31:44 GMT
Title: Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative
Authors: Li Wang, Xi Chen, XiangWen Deng, HuaHui Yi, ZeKun Jiang, Kang Li, Jian Li,
Abstract summary: Multimodal large language models (MLLMs) show promising performance on medical visual question answering (VQA) and report generation.<n>We evaluated MLLM architectures on knee osteoarthritis (OA) radiograph classification.
Score: 14.002322217782364
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal large language models (MLLMs) show promising performance on medical visual question answering (VQA) and report generation, but these generation and explanation abilities do not reliably transfer to disease-specific classification. We evaluated MLLM architectures on knee osteoarthritis (OA) radiograph classification, which remains underrepresented in existing medical MLLM benchmarks, even though knee OA affects an estimated 300 to 400 million people worldwide. Through systematic ablation studies manipulating the vision encoder, the connector, and the large language model (LLM) across diverse training strategies, we measured each component's contribution to diagnostic accuracy. In our classification task, a trained vision encoder alone could outperform full MLLM pipelines in classification accuracy and fine-tuning the LLM provided no meaningful improvement over prompt-based guidance. And LoRA fine-tuning on a small, class-balanced dataset (500 images) gave better results than training on a much larger but class-imbalanced set (5,778 images), indicating that data balance and quality can matter more than raw scale for this task. These findings suggest that for domain-specific medical classification, LLMs are more effective as interpreters and report generators rather than as primary classifiers. Therefore, the MLLM architecture appears less suitable for medical image diagnostic classification tasks that demand high certainty. We recommend prioritizing vision encoder optimization and careful dataset curation when developing clinically applicable systems.

Related papers

Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework [0.0]
We present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions.<n>We fine-tune two state-of-the-art open LLMs (LLaMA2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization.<n>Our fine-tuned LLaMA2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline.
arXiv Detail & Related papers (2025-12-05T16:38:47Z)
Applications of Small Language Models in Medical Imaging Classification with a Focus on Prompt Strategies [9.1953139634128]
This study investigates the performance of small language models (SLMs) in a medical imaging classification task.<n>Using the NIH Chest X-ray dataset, we evaluate multiple SLMs on the task of classifying chest X-ray positions.<n>Our results show that certain SLMs achieve competitive accuracy with well-crafted prompts.
arXiv Detail & Related papers (2025-08-18T21:48:45Z)
EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow [43.82288530883818]
EH-Benchmark is a novel ophthalmology benchmark designed to evaluate hallucinations in Medical Large Language Models.<n>We categorize hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition.<n>Our framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability.
arXiv Detail & Related papers (2025-07-24T12:07:36Z)
Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models [3.3091869879941687]
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding.<n>We reformulate each task into instruction-based prompts suitable for vision-language reasoning.<n>Results show that multi-task training improves robustness and accuracy.
arXiv Detail & Related papers (2025-05-22T13:18:44Z)
LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition? [59.81732629438753]
We propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition via utilizing the existing MLLM features.<n>Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture.<n>We also introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models.
arXiv Detail & Related papers (2025-03-10T16:05:40Z)
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.<n>Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.<n>We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z)
ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models [95.47808515575382]
ExGra-Med is a novel framework for vision-language integration in medical AI.<n>It aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence.<n>It matches LLaVA-Med's performance using just 10% of the pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
SSLM: Self-Supervised Learning for Medical Diagnosis from MR Video [19.5917119072985]
In this paper, we propose a self-supervised learning approach to learn the spatial anatomical representations from magnetic resonance (MR) video clips. The proposed pretext model learns meaningful spatial context-invariant representations. Different experiments show that the features learnt by the pretext model provide explainable performance in the downstream task.
arXiv Detail & Related papers (2021-04-21T12:01:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.