Is ChatGPT-5 Ready for Mammogram VQA?
- URL: http://arxiv.org/abs/2508.11628v1
- Date: Fri, 15 Aug 2025 17:56:24 GMT
- Title: Is ChatGPT-5 Ready for Mammogram VQA?
- Authors: Qiang Li, Shansong Wang, Mingzhe Hu, Mojtaba Safari, Zachary Eidex, Xiaofeng Yang,
- Abstract summary: GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models.<n>While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications.
- Score: 4.156123728258067
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks.
Related papers
- Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary [36.736436091313585]
This commentary is the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o.<n> GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA.<n>When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence.
arXiv Detail & Related papers (2026-03-05T03:24:48Z) - PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z) - A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z) - Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight [1.0471566053937098]
GPT-5 is a large language model that has been specifically marketed towards oncology use.<n>On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%)<n>In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69)<n>While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation
arXiv Detail & Related papers (2025-08-29T16:55:25Z) - Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology [4.156123728258067]
We present a zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks.<n>Across all datasets, GPT-5 achieved the highest accuracy, with substantial gains over GPT-4o up to +200% in challenging anatomical regions.<n>GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving.
arXiv Detail & Related papers (2025-08-15T16:14:51Z) - Performance of GPT-5 in Brain Tumor MRI Reasoning [4.156123728258067]
Large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning.<n>We evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark.<n>Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%)
arXiv Detail & Related papers (2025-08-14T17:35:31Z) - Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks [1.6567957832859204]
This dataset includes 22,349 images from 7,461 patients encompassing chest radiography, colon pathology, endoscopy, neonatal jaundice assessment, and retinal fundoscopy.<n>Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p.001).<n>All models struggled with retinal fundoscopy; Qwen2.5 and Gemma3 achieved the highest, albeit modest, accuracies at 18.6% (comparable, p=.99), significantly better than other tested models (p.001)
arXiv Detail & Related papers (2025-08-01T18:28:37Z) - A Clinician-Friendly Platform for Ophthalmic Image Analysis Without Technical Barriers [51.45596445363302]
GlobeReady is a clinician-friendly AI platform that enables fundus disease diagnosis without retraining, fine-tuning, or the needs for technical expertise.<n>We demonstrate high accuracy across imaging modalities: 93.9-98.5% for 11 fundus diseases using color fundus photographs (CPFs) and 87.2-92.7% for 15 fundus diseases using optic coherence tomography ( OCT) scans.<n>By leveraging training-free local feature augmentation, GlobeReady platform effectively mitigates domain shifts across centers and populations.
arXiv Detail & Related papers (2025-04-22T14:17:22Z) - A Multi-Modal AI System for Screening Mammography: Integrating 2D and 3D Imaging to Improve Breast Cancer Detection in a Prospective Clinical Study [2.9992821862882546]
False-positive recalls remain a concern in breast cancer screening.<n>We developed a multi-modal artificial intelligence system integrating full-field digital mammography and synthetic mammography.<n>Our AI system, trained on approximately 500,000 exams, achieved 0.945 AUROC on an internal test set.
arXiv Detail & Related papers (2025-04-08T03:29:40Z) - Detection of subclinical atherosclerosis by image-based deep learning on chest x-ray [86.38767955626179]
Deep-learning algorithm to predict coronary artery calcium (CAC) score was developed on 460 chest x-ray.
The diagnostic accuracy of the AICAC model assessed by the area under the curve (AUC) was the primary outcome.
arXiv Detail & Related papers (2024-03-27T16:56:14Z) - Multi-Head Feature Pyramid Networks for Breast Mass Detection [48.24995569980701]
We propose the multi-head feature pyramid module (MHFPN) to solve the problem of unbalanced focus of target boxes during feature map fusion.
Experimental studies show that, comparing to the SOTA detection baselines, our method improves by 6.58% (in AP@50) and 5.4% (in TPR@50) on the commonly used INbreast dataset.
arXiv Detail & Related papers (2023-02-22T03:02:52Z) - EMT-NET: Efficient multitask network for computer-aided diagnosis of
breast cancer [58.720142291102135]
We propose an efficient and light-weighted learning architecture to classify and segment breast tumors simultaneously.
We incorporate a segmentation task into a tumor classification network, which makes the backbone network learn representations focused on tumor regions.
The accuracy, sensitivity, and specificity of tumor classification is 88.6%, 94.1%, and 85.3%, respectively.
arXiv Detail & Related papers (2022-01-13T05:24:40Z) - Comparison of Machine Learning Classifiers to Predict Patient Survival
and Genetics of GBM: Towards a Standardized Model for Clinical Implementation [44.02622933605018]
Radiomic models have been shown to outperform clinical data for outcome prediction in glioblastoma (GBM)
We aimed to compare nine machine learning classifiers to predict overall survival (OS), isocitrate dehydrogenase (IDH) mutation, O-6-methylguanine-DNA-methyltransferase (MGMT) promoter methylation, epidermal growth factor receptor (EGFR) VII amplification and Ki-67 expression in GBM patients.
xGB obtained maximum accuracy for OS (74.5%), AB for IDH mutation (88%), MGMT methylation (71,7%), Ki-67 expression (86,6%), and EGFR amplification (81,
arXiv Detail & Related papers (2021-02-10T15:10:37Z) - Feature Fusion of Raman Chemical Imaging and Digital Histopathology
using Machine Learning for Prostate Cancer Detection [0.0]
This study uses multimodal images formed from stained Digital Histopathology (DP) and unstained Raman Chemical Imaging (RCI)
The hypothesis tested was whether multimodal image models can outperform single modality baseline models in terms of diagnostic accuracy.
arXiv Detail & Related papers (2021-01-18T22:11:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.