Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
- URL: http://arxiv.org/abs/2405.20421v4
- Date: Sat, 05 Oct 2024 00:09:21 GMT
- Title: Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
- Authors: Qianqi Yan, Xuehai He, Xiang Yue, Xin Eric Wang,
- Abstract summary: Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA)
This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions.
- Score: 24.10436440624249
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address this critical evaluation problem, we introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. Our evaluation reveals that top-performing models like GPT-4o, GPT-4V, and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. Besides, models like LLaVA-Med struggle even with more general questions, and results from CheXagent demonstrate the transferability of expertise across different modalities of the same organ, showing that specialized domain knowledge is still crucial for improving performance. This study underscores the urgent need for more robust evaluation to ensure the reliability of LMMs in critical fields like medical diagnosis, and current LMMs are still far from applicable to those fields.
Related papers
- MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow [16.089816031251335]
Multi-modal Large Language Models (MLLMs) have gained significant attention and achieved success across various domains.
They lack detailed perception of visual inputs, limiting their ability to perform quantitative image analysis.
We propose MedAgent-Pro, an evidence-based reasoning agentic system designed to achieve reliable, explainable, and precise medical diagnoses.
arXiv Detail & Related papers (2025-03-21T14:04:18Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.
We propose a novel approach utilizing structured medical reasoning.
Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - Uncertainty-aware abstention in medical diagnosis based on medical texts [87.88110503208016]
This study addresses the critical issue of reliability for AI-assisted medical diagnosis.
We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis.
We introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks.
arXiv Detail & Related papers (2025-02-25T10:15:21Z) - LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models [18.6994780408699]
Large Language Models (LLMs) face significant challenges in medical question answering.
We propose a novel approach incorporating similar case generation within a multi-agent medical question-answering system.
Our method capitalizes on the model's inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data.
arXiv Detail & Related papers (2024-12-31T19:55:45Z) - Superhuman performance of a large language model on the reasoning tasks of a physician [10.043418251604624]
Performance of large language models (LLMs) on medical tasks has traditionally been evaluated using multiple choice question benchmarks.
We evaluate OpenAI's o1-preview model, a model developed to increase run-time via chain of thought processes prior to generating a response.
arXiv Detail & Related papers (2024-12-14T14:46:18Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.
Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.
We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z) - Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases.
We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases.
We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models.
Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis [28.421857904824627]
MiniGPT-Med is a vision-language model derived from large-scale language models and tailored for medical applications.
It is capable of performing tasks such as medical report generation, visual question answering (VQA), and disease identification within medical imagery.
It achieves state-of-the-art performance on medical report generation, higher than the previous best model by 19% accuracy.
arXiv Detail & Related papers (2024-07-04T18:21:10Z) - RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question
Answering and Clinical Reasoning [14.366349078707263]
RJUA-MedDQA is a comprehensive benchmark in the field of medical specialization.
This work introduces RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization.
arXiv Detail & Related papers (2024-02-19T06:57:02Z) - Rescuing referral failures during automated diagnosis of domain-shifted
medical images [17.349847762608086]
We show that even state-of-the-art domain generalization approaches fail severely during referral when tested on medical images acquired from a different demographic or using a different technology.
We evaluate novel combinations of robust generalization and post hoc referral approaches, that rescue these failures and achieve significant performance improvements.
arXiv Detail & Related papers (2023-11-28T13:14:55Z) - BMAD: Benchmarks for Medical Anomaly Detection [51.22159321912891]
Anomaly detection (AD) is a fundamental research problem in machine learning and computer vision.
In medical imaging, AD is especially vital for detecting and diagnosing anomalies that may indicate rare diseases or conditions.
We introduce a comprehensive evaluation benchmark for assessing anomaly detection methods on medical images.
arXiv Detail & Related papers (2023-06-20T20:23:46Z) - Scalable Online Disease Diagnosis via Multi-Model-Fused Actor-Critic
Reinforcement Learning [9.274138493400436]
For those seeking healthcare advice online, AI based dialogue agents capable of interacting with patients to perform automatic disease diagnosis are a viable option.
This can be formulated as a problem of sequential feature (symptom) selection and classification for which reinforcement learning (RL) approaches have been proposed as a natural solution.
We propose a Multi-Model-Fused Actor-Critic (MMF-AC) RL framework that consists of a generative actor network and a diagnostic critic network.
arXiv Detail & Related papers (2022-06-08T03:06:16Z) - Towards Causality-Aware Inferring: A Sequential Discriminative Approach
for Medical Diagnosis [142.90770786804507]
Medical diagnosis assistant (MDA) aims to build an interactive diagnostic agent to sequentially inquire about symptoms for discriminating diseases.
This work attempts to address these critical issues in MDA by taking advantage of the causal diagram.
We propose a propensity-based patient simulator to effectively answer unrecorded inquiry by drawing knowledge from the other records.
arXiv Detail & Related papers (2020-03-14T02:05:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.