Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs
- URL: http://arxiv.org/abs/2509.18015v1
- Date: Mon, 22 Sep 2025 16:54:23 GMT
- Title: Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs
- Authors: Advait Gosai, Arun Kavishwar, Stephanie L. McNamara, Soujanya Samineni, Renato Umeton, Alexander Chowdhury, William Lotter,
- Abstract summary: We evaluate two general-purpose large language models (LLMs) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs.<n>GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%)<n>GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited implausible predictions more frequently.
- Score: 33.80781505782195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model's spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5's predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, showing limited capacity to generalize to this novel task. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.
Related papers
- Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary [36.736436091313585]
This commentary is the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o.<n> GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA.<n>When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence.
arXiv Detail & Related papers (2026-03-05T03:24:48Z) - DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs [54.8829900010621]
Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision.<n>We present a comprehensive framework to address these gaps.<n>First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats.<n>Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600
arXiv Detail & Related papers (2026-01-05T07:55:36Z) - MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images [0.0]
This study presents a comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4.<n>The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37%.<n>These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.
arXiv Detail & Related papers (2025-12-29T08:48:36Z) - A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z) - XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography [6.447908430647854]
We present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays.<n>We generate visual explanations using cross-attention and similarity-based localization maps.<n>We quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies.
arXiv Detail & Related papers (2025-10-22T13:52:19Z) - Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping [80.92960114162746]
We propose PathPT, a novel framework that exploits the potential of vision-language pathology foundation models.<n>PathPT converts WSI-level supervision into fine-grained tile-level guidance by leveraging the zero-shot capabilities of VL models.<n>Results show that PathPT consistently delivers superior performance, achieving substantial gains in subtyping accuracy and cancerous region grounding ability.
arXiv Detail & Related papers (2025-08-21T18:04:41Z) - PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology [33.51485504161335]
We present PathBench, the first comprehensive benchmark for pathology foundation models (PFMs)<n>Our framework incorporates large-scale data, enabling objective comparison of PFMs.<n>We have collected 15,888 WSIs from 8,549 patients across 10 hospitals, encompassing over 64 diagnosis and prognosis tasks.
arXiv Detail & Related papers (2025-05-26T16:42:22Z) - Anatomy-Guided Radiology Report Generation with Pathology-Aware Regional Prompts [3.1019279528120363]
Radiology reporting generative AI holds significant potential to alleviate clinical workloads and streamline medical care.
Existing systems often fall short due to their reliance on fixed size, patch-level image features and insufficient incorporation of pathological information.
We propose an innovative approach that leverages pathology-aware regional prompts to explicitly integrate anatomical and pathological information of various scales.
arXiv Detail & Related papers (2024-11-16T12:36:20Z) - Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses [14.884877292068351]
This study introduces a novel evaluation framework, named GPTRadScore''
It assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, in generating descriptions for prospectively-identified findings.
By employing a decomposition technique based on GPT-4, GPTRadScore compares these generated descriptions with gold-standard report sentences, analyzing their accuracy in terms of body part, location, and type of finding.
arXiv Detail & Related papers (2024-03-08T21:16:28Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.