Related papers: MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

URL: http://arxiv.org/abs/2512.23304v1
Date: Mon, 29 Dec 2025 08:48:36 GMT
Title: MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images
Authors: Md. Sazzadul Islam Prottasha, Nabil Walid Rafi,
Abstract summary: This study presents a comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4.<n>The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37%.<n>These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.

Related papers

Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary [36.736436091313585]
This commentary is the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o.<n> GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA.<n>When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence.
arXiv Detail & Related papers (2026-03-05T03:24:48Z)
Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis [6.04562866374803]
We propose a Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing.<n>We present a benchmark of MoLRE across six state-of-the-art medical imaging foundation models spanning 2D and 3D architectures, general-domain, medical-domain, and head CT-specific pretraining, and model sizes ranging from 7M to 431M parameters.
arXiv Detail & Related papers (2026-02-28T14:32:38Z)
A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z)
Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs [33.80781505782195]
We evaluate two general-purpose large language models (LLMs) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs.<n>GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%)<n>GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited implausible predictions more frequently.
arXiv Detail & Related papers (2025-09-22T16:54:23Z)
Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping [80.92960114162746]
We propose PathPT, a novel framework that exploits the potential of vision-language pathology foundation models.<n>PathPT converts WSI-level supervision into fine-grained tile-level guidance by leveraging the zero-shot capabilities of VL models.<n>Results show that PathPT consistently delivers superior performance, achieving substantial gains in subtyping accuracy and cancerous region grounding ability.
arXiv Detail & Related papers (2025-08-21T18:04:41Z)
Performance of GPT-5 in Brain Tumor MRI Reasoning [4.156123728258067]
Large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning.<n>We evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark.<n>Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%)
arXiv Detail & Related papers (2025-08-14T17:35:31Z)
Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis [28.192924379673862]
Cortical lesions (CLs) have emerged as valuable biomarkers in multiple sclerosis (MS)<n>We propose a comprehensive benchmark of CL detection and segmentation in MRI.<n>We rely on the self-configuring nnU-Net framework, designed for medical imaging segmentation, and propose adaptations tailored to the improved CL detection.
arXiv Detail & Related papers (2025-07-16T09:56:11Z)
Towards a Multimodal MRI-Based Foundation Model for Multi-Level Feature Exploration in Segmentation, Molecular Subtyping, and Grading of Glioma [0.2796197251957244]
Multi-Task S-UNETR (MTSUNET) model is a novel foundation-based framework built on the BrainSegFounder model.<n>It simultaneously performs glioma segmentation, histological subtyping and neuroimaging subtyping.<n>It shows significant potential for advancing noninvasive, personalized glioma management by improving predictive accuracy and interpretability.
arXiv Detail & Related papers (2025-03-10T01:27:09Z)
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.<n>Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.<n>We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision. This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z)
Holistic Evaluation of GPT-4V for Biomedical Imaging [113.46226609088194]
GPT-4V represents a breakthrough in artificial general intelligence for computer vision. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more. Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization.
arXiv Detail & Related papers (2023-11-10T18:40:44Z)
Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis [59.35504779947686]
GPT-4V is OpenAI's newest model for multimodal medical diagnosis. Our evaluation encompasses 17 human body systems. GPT-4V demonstrates proficiency in distinguishing between medical image modalities and anatomy. It faces significant challenges in disease diagnosis and generating comprehensive reports.
arXiv Detail & Related papers (2023-10-15T18:32:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.