Addressing cognitive bias in medical language models
- URL: http://arxiv.org/abs/2402.08113v3
- Date: Tue, 20 Feb 2024 23:45:43 GMT
- Title: Addressing cognitive bias in medical language models
- Authors: Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Olshvang, Tawsifur
Rahman, Ji Woong Kim, Rojin Ziaei, Jason Eshraghian, Peter Abadir, Rama
Chellappa
- Abstract summary: BiasMedQA is a benchmark for evaluating cognitive biases in large language models (LLMs) applied to medical tasks.
We tested six models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3.
GPT-4 stood out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias.
- Score: 25.58126133789956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is increasing interest in the application large language models (LLMs)
to the medical field, in part because of their impressive performance on
medical exam questions. While promising, exam questions do not reflect the
complexity of real patient-doctor interactions. In reality, physicians'
decisions are shaped by many complex factors, such as patient compliance,
personal experience, ethical beliefs, and cognitive bias. Taking a step toward
understanding this, our hypothesis posits that when LLMs are confronted with
clinical questions containing cognitive biases, they will yield significantly
less accurate responses compared to the same questions presented without such
biases. In this study, we developed BiasMedQA, a benchmark for evaluating
cognitive biases in LLMs applied to medical tasks. Using BiasMedQA we evaluated
six LLMs, namely GPT-4, Mixtral-8x70B, GPT-3.5, PaLM-2, Llama 2 70B-chat, and
the medically specialized PMC Llama 13B. We tested these models on 1,273
questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3,
modified to replicate common clinically-relevant cognitive biases. Our analysis
revealed varying effects for biases on these LLMs, with GPT-4 standing out for
its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B,
which were disproportionately affected by cognitive bias. Our findings
highlight the critical need for bias mitigation in the development of medical
LLMs, pointing towards safer and more reliable applications in healthcare.
Related papers
- Belief in the Machine: Investigating Epistemological Blind Spots of Language Models [51.63547465454027]
Language models (LMs) are essential for reliable decision-making in fields like healthcare, law, and journalism.
This study systematically evaluates the capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE.
Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios.
Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data.
arXiv Detail & Related papers (2024-10-28T16:38:20Z) - Language Models And A Second Opinion Use Case: The Pocket Professional [0.0]
This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making.
The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses.
arXiv Detail & Related papers (2024-10-27T23:48:47Z) - How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? [2.7476176772825904]
This research investigates the evaluation and mitigation of bias in Large Language Models (LLMs)
We introduce a novel Counterfactual Patient Variations (CPV) dataset derived from the JAMA Clinical Challenge.
Using this dataset, we built a framework for bias evaluation, employing both Multiple Choice Questions (MCQs) and corresponding explanations.
arXiv Detail & Related papers (2024-10-21T23:14:10Z) - CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As [1.0034156461900003]
Large language models (LLMs) show promising results in many aspects of language-based clinical practice.
We used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA"
We benchmarked this dataset using more than 24,500 questions on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus.
We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.
arXiv Detail & Related papers (2024-06-06T08:41:46Z) - Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z) - Language models are susceptible to incorrect patient self-diagnosis in
medical applications [0.0]
We present a variety of LLMs with multiple-choice questions from U.S. medical board exams modified to include self-diagnostic reports from patients.
Our findings highlight that when a patient proposes incorrect bias-validating information, the diagnostic accuracy of LLMs drop dramatically.
arXiv Detail & Related papers (2023-09-17T19:56:39Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.