Related papers: Enabling Scalable Evaluation of Bias Patterns in Medical LLMs

Enabling Scalable Evaluation of Bias Patterns in Medical LLMs

URL: http://arxiv.org/abs/2410.14763v1
Date: Fri, 18 Oct 2024 14:17:03 GMT
Title: Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
Authors: Hamed Fayyaz, Raphael Poulain, Rahmatollah Beheshti,
Abstract summary: Large language models (LLMs) have shown impressive potential in helping with numerous medical challenges. One major area of concern relates to biased behaviors of LLMs in medical applications, leading to unfair treatment of individuals. We present a new method to scale up such bias evaluations by automatically generating test cases based on rigorous medical evidence.
Score: 2.089191490381739
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown impressive potential in helping with numerous medical challenges. Deploying LLMs in high-stakes applications such as medicine, however, brings in many concerns. One major area of concern relates to biased behaviors of LLMs in medical applications, leading to unfair treatment of individuals. To pave the way for the responsible and impactful deployment of Med LLMs, rigorous evaluation is a key prerequisite. Due to the huge complexity and variability of different medical scenarios, existing work in this domain has primarily relied on using manually crafted datasets for bias evaluation. In this study, we present a new method to scale up such bias evaluations by automatically generating test cases based on rigorous medical evidence. We specifically target the challenges of a) domain-specificity of bias characterization, b) hallucinating while generating the test cases, and c) various dependencies between the health outcomes and sensitive attributes. To that end, we offer new methods to address these challenges integrated with our generative pipeline, using medical knowledge graphs, medical ontologies, and customized general LLM evaluation frameworks in our method. Through a series of extensive experiments, we show that the test cases generated by our proposed method can effectively reveal bias patterns in Med LLMs at larger and more flexible scales than human-crafted datasets. We publish a large bias evaluation dataset using our pipeline, which is dedicated to a few medical case studies. A live demo of our application for vignette generation is available at https://vignette.streamlit.app. Our code is also available at https://github.com/healthylaife/autofair.

Related papers

Tell Me You're Biased Without Telling Me You're Biased -- Toward Revealing Implicit Biases in Medical LLMs [1.7166356507622822]
Large language models (LLMs) that are used in medical applications are known to show biased and unfair patterns.<n>It is crucial to identify these bias patterns to enable effective mitigation of their impact.<n>We present a novel framework combining knowledge graphs (KGs) with auxiliary LLMs to systematically reveal complex bias patterns.
arXiv Detail & Related papers (2025-07-26T02:33:48Z)
Bias in Large Language Models Across Clinical Applications: A Systematic Review [0.0]
Large language models (LLMs) are rapidly being integrated into healthcare, promising to enhance various clinical tasks. This systematic review investigates the prevalence, sources, manifestations, and clinical implications of bias in LLMs.
arXiv Detail & Related papers (2025-04-03T13:32:08Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? [2.7476176772825904]
This research investigates the evaluation and mitigation of bias in Large Language Models (LLMs) We introduce a novel Counterfactual Patient Variations (CPV) dataset derived from the JAMA Clinical Challenge. Using this dataset, we built a framework for bias evaluation, employing both Multiple Choice Questions (MCQs) and corresponding explanations.
arXiv Detail & Related papers (2024-10-21T23:14:10Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
Aligning (Medical) LLMs for (Counterfactual) Fairness [2.089191490381739]
Large Language Models (LLMs) have emerged as promising solutions for medical and clinical decision support applications. LLMs are subject to different types of biases, which can lead to unfair treatment of individuals, worsening health disparities, and reducing trust in AI-augmented medical tools. We present a new model alignment approach for aligning LLMs using a preference optimization method within a knowledge distillation framework.
arXiv Detail & Related papers (2024-08-22T01:11:27Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
Large Language Model Distilling Medication Recommendation Model [61.89754499292561]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs) Our research aims to transform existing medication recommendation methodologies using LLMs. To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z)
Distilling Large Language Models for Matching Patients to Clinical Trials [3.4068841624198942]
The recent success of large language models (LLMs) has paved the way for their adoption in the high-stakes domain of healthcare. This study presents the first systematic examination of the efficacy of both proprietary (GPT-3.5, and GPT-4) and open-source LLMs (LLAMA 7B,13B, and 70B) for the task of patient-trial matching. Our findings reveal that open-source LLMs, when fine-tuned on this limited and synthetic dataset, demonstrate performance parity with their proprietary counterparts.
arXiv Detail & Related papers (2023-12-15T17:11:07Z)
Extrinsically-Focused Evaluation of Omissions in Medical Summarization [9.847304366680772]
Large language models (LLMs) have shown promise in safety-critical applications such as healthcare, yet the ability to quantify performance has lagged. We propose MED-OMIT as a metric to explore the challenge of evaluating a summary of a patient's medical record.
arXiv Detail & Related papers (2023-11-14T16:46:15Z)
Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z)
Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study [60.56194508762205]
We reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models. We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation. We present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models.
arXiv Detail & Related papers (2023-04-21T11:54:44Z)
Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.