Related papers: Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

URL: http://arxiv.org/abs/2405.20421v2
Date: Fri, 21 Jun 2024 09:32:19 GMT
Title: Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
Authors: Qianqi Yan, Xuehai He, Xiang Yue, Xin Eric Wang,
Abstract summary: Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA) This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions.
Score: 24.10436440624249
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address this critical evaluation problem, we introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. Our evaluation reveals that top-performing models like GPT-4o, GPT-4V, and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. Besides, models like LLaVA-Med struggle even with more general questions, and results from CheXagent demonstrate the transferability of expertise across different modalities of the same organ, showing that specialized domain knowledge is still crucial for improving performance. This study underscores the urgent need for more robust evaluation to ensure the reliability of LMMs in critical fields like medical diagnosis, and current LMMs are still far from applicable to those fields.

Related papers

Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports [4.769418278782809]
We introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs' ability to identify and correct errors in CT reports.<n>The benchmark includes six error categories - four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo)
arXiv Detail & Related papers (2025-06-24T00:51:03Z)
MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports [49.00805568780791]
We introduce MedCaseReasoning, the first open-access dataset for evaluating Large Language Models (LLMs) on their ability to align with clinician-authored diagnostic reasoning.<n>The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements.<n>We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning.
arXiv Detail & Related papers (2025-05-16T22:34:36Z)
MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow [16.089816031251335]
Multi-modal Large Language Models (MLLMs) have gained significant attention and achieved success across various domains. They lack detailed perception of visual inputs, limiting their ability to perform quantitative image analysis. We propose MedAgent-Pro, an evidence-based reasoning agentic system designed to achieve reliable, explainable, and precise medical diagnoses.
arXiv Detail & Related papers (2025-03-21T14:04:18Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Uncertainty-aware abstention in medical diagnosis based on medical texts [87.88110503208016]
This study addresses the critical issue of reliability for AI-assisted medical diagnosis. We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis. We introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks.
arXiv Detail & Related papers (2025-02-25T10:15:21Z)
LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models [18.6994780408699]
Large Language Models (LLMs) face significant challenges in medical question answering. We propose a novel approach incorporating similar case generation within a multi-agent medical question-answering system. Our method capitalizes on the model's inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data.
arXiv Detail & Related papers (2024-12-31T19:55:45Z)
Superhuman performance of a large language model on the reasoning tasks of a physician [10.043418251604624]
Performance of large language models (LLMs) on medical tasks has traditionally been evaluated using multiple choice question benchmarks. We evaluate OpenAI's o1-preview model, a model developed to increase run-time via chain of thought processes prior to generating a response.
arXiv Detail & Related papers (2024-12-14T14:46:18Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses. We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z)
Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases. We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis [28.421857904824627]
MiniGPT-Med is a vision-language model derived from large-scale language models and tailored for medical applications. It is capable of performing tasks such as medical report generation, visual question answering (VQA), and disease identification within medical imagery. It achieves state-of-the-art performance on medical report generation, higher than the previous best model by 19% accuracy.
arXiv Detail & Related papers (2024-07-04T18:21:10Z)
RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning [14.366349078707263]
RJUA-MedDQA is a comprehensive benchmark in the field of medical specialization. This work introduces RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization.
arXiv Detail & Related papers (2024-02-19T06:57:02Z)
Rescuing referral failures during automated diagnosis of domain-shifted medical images [17.349847762608086]
We show that even state-of-the-art domain generalization approaches fail severely during referral when tested on medical images acquired from a different demographic or using a different technology. We evaluate novel combinations of robust generalization and post hoc referral approaches, that rescue these failures and achieve significant performance improvements.
arXiv Detail & Related papers (2023-11-28T13:14:55Z)
BMAD: Benchmarks for Medical Anomaly Detection [51.22159321912891]
Anomaly detection (AD) is a fundamental research problem in machine learning and computer vision. In medical imaging, AD is especially vital for detecting and diagnosing anomalies that may indicate rare diseases or conditions. We introduce a comprehensive evaluation benchmark for assessing anomaly detection methods on medical images.
arXiv Detail & Related papers (2023-06-20T20:23:46Z)
Scalable Online Disease Diagnosis via Multi-Model-Fused Actor-Critic Reinforcement Learning [9.274138493400436]
For those seeking healthcare advice online, AI based dialogue agents capable of interacting with patients to perform automatic disease diagnosis are a viable option. This can be formulated as a problem of sequential feature (symptom) selection and classification for which reinforcement learning (RL) approaches have been proposed as a natural solution. We propose a Multi-Model-Fused Actor-Critic (MMF-AC) RL framework that consists of a generative actor network and a diagnostic critic network.
arXiv Detail & Related papers (2022-06-08T03:06:16Z)
Towards Causality-Aware Inferring: A Sequential Discriminative Approach for Medical Diagnosis [142.90770786804507]
Medical diagnosis assistant (MDA) aims to build an interactive diagnostic agent to sequentially inquire about symptoms for discriminating diseases. This work attempts to address these critical issues in MDA by taking advantage of the causal diagram. We propose a propensity-based patient simulator to effectively answer unrecorded inquiry by drawing knowledge from the other records.
arXiv Detail & Related papers (2020-03-14T02:05:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.