MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
- URL: http://arxiv.org/abs/2406.06573v2
- Date: Sun, 1 Sep 2024 19:38:02 GMT
- Title: MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
- Authors: Robert Osazuwa Ness, Katie Matton, Hayden Helm, Sheng Zhang, Junaid Bajwa, Carey E. Priebe, Eric Horvitz,
- Abstract summary: Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks.
We show how to use performance on a "MedFuzzed" benchmark, as well as individual successful attacks.
- Score: 24.258546825446324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful "attacks" modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless "trick" the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a "MedFuzzed" benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.
Related papers
- CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks [63.562924932512765]
Large Language Models (LLMs) have advanced the state-of-the-art in various coding tasks.<n>LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models.
arXiv Detail & Related papers (2025-07-14T17:56:29Z) - Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge [7.064104563689608]
Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction.<n>This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction.
arXiv Detail & Related papers (2025-06-01T02:01:52Z) - Medical Large Language Model Benchmarks Should Prioritize Construct Validity [9.453444826672474]
Medical large language models (LLMs) research often makes bold claims, from encoding clinical knowledge to reasoning like a physician.
But how do we separate real progress from a leaderboard flex?
We argue that medical LLM benchmarks should (and indeed can) be empirically evaluated for their construct validity.
arXiv Detail & Related papers (2025-03-12T05:08:02Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.
We propose a novel approach utilizing structured medical reasoning.
Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.
Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - A Benchmark for Long-Form Medical Question Answering [4.815957808858573]
There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA)
Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions.
In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors.
arXiv Detail & Related papers (2024-11-14T22:54:38Z) - LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering.
We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z) - Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation [61.350306618479365]
Leakage of benchmarks can prevent the accurate assessment of large language models' true performance.
We propose Inference-Time Decontamination (ITD) to address this issue.
ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU.
arXiv Detail & Related papers (2024-06-20T04:35:59Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning [36.400896909161006]
We develop systems that proactively ask questions to gather more information and respond reliably.
We introduce a benchmark - MediQ - to evaluate question-asking ability in LLMs.
arXiv Detail & Related papers (2024-06-03T01:32:52Z) - OpenMedLM: Prompt engineering can out-perform fine-tuning in medical
question-answering with open-source large language models [4.556924372105915]
Open-source (OS) models represent a key area of growth for medical LLMs.
We present OpenMedLM, a prompting platform which delivers state-of-the-art (SOTA) performance for OS LLMs on medical benchmarks.
arXiv Detail & Related papers (2024-02-29T17:19:39Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering [45.84961106102445]
Large Language Models (LLMs) often perform poorly on domain-specific tasks such as medical question answering (QA)
We propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the LLM's query prompt.
Our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%.
arXiv Detail & Related papers (2023-09-27T21:26:03Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.