Related papers: Addressing cognitive bias in medical language models

Addressing cognitive bias in medical language models

URL: http://arxiv.org/abs/2402.08113v3
Date: Tue, 20 Feb 2024 23:45:43 GMT
Title: Addressing cognitive bias in medical language models
Authors: Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Olshvang, Tawsifur Rahman, Ji Woong Kim, Rojin Ziaei, Jason Eshraghian, Peter Abadir, Rama Chellappa
Abstract summary: BiasMedQA is a benchmark for evaluating cognitive biases in large language models (LLMs) applied to medical tasks. We tested six models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3. GPT-4 stood out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias.
Score: 25.58126133789956
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There is increasing interest in the application large language models (LLMs) to the medical field, in part because of their impressive performance on medical exam questions. While promising, exam questions do not reflect the complexity of real patient-doctor interactions. In reality, physicians' decisions are shaped by many complex factors, such as patient compliance, personal experience, ethical beliefs, and cognitive bias. Taking a step toward understanding this, our hypothesis posits that when LLMs are confronted with clinical questions containing cognitive biases, they will yield significantly less accurate responses compared to the same questions presented without such biases. In this study, we developed BiasMedQA, a benchmark for evaluating cognitive biases in LLMs applied to medical tasks. Using BiasMedQA we evaluated six LLMs, namely GPT-4, Mixtral-8x70B, GPT-3.5, PaLM-2, Llama 2 70B-chat, and the medically specialized PMC Llama 13B. We tested these models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3, modified to replicate common clinically-relevant cognitive biases. Our analysis revealed varying effects for biases on these LLMs, with GPT-4 standing out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias. Our findings highlight the critical need for bias mitigation in the development of medical LLMs, pointing towards safer and more reliable applications in healthcare.

Related papers

MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering [29.289606699835293]
Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks.<n>We introduce the MedPAIR dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions.
arXiv Detail & Related papers (2025-05-29T22:23:48Z)
Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions [16.21971764311474]
We evaluate large language models (LLMs) on cancer-related questions drawn from real patients. LLMs frequently fail to recognize or address false presuppositions in the questions. These findings expose a critical gap in the clinical reliability of LLMs.
arXiv Detail & Related papers (2025-04-15T16:37:32Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references. We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
Belief in the Machine: Investigating Epistemological Blind Spots of Language Models [51.63547465454027]
Language models (LMs) are essential for reliable decision-making in fields like healthcare, law, and journalism. This study systematically evaluates the capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE. Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios. Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data.
arXiv Detail & Related papers (2024-10-28T16:38:20Z)
Language Models And A Second Opinion Use Case: The Pocket Professional [0.0]
This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making. The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses.
arXiv Detail & Related papers (2024-10-27T23:48:47Z)
How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? [2.7476176772825904]
This research investigates the evaluation and mitigation of bias in Large Language Models (LLMs) We introduce a novel Counterfactual Patient Variations (CPV) dataset derived from the JAMA Clinical Challenge. Using this dataset, we built a framework for bias evaluation, employing both Multiple Choice Questions (MCQs) and corresponding explanations.
arXiv Detail & Related papers (2024-10-21T23:14:10Z)
CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios. The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As [1.0034156461900003]
Large language models (LLMs) show promising results in many aspects of language-based clinical practice. We used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA" We benchmarked this dataset using more than 24,500 questions on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus. We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.
arXiv Detail & Related papers (2024-06-06T08:41:46Z)
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered. We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z)
Language models are susceptible to incorrect patient self-diagnosis in medical applications [0.0]
We present a variety of LLMs with multiple-choice questions from U.S. medical board exams modified to include self-diagnostic reports from patients. Our findings highlight that when a patient proposes incorrect bias-validating information, the diagnostic accuracy of LLMs drop dramatically.
arXiv Detail & Related papers (2023-09-17T19:56:39Z)
Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT) LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules. We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)
Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored. This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.