Unmasking and Quantifying Racial Bias of Large Language Models in
Medical Report Generation
- URL: http://arxiv.org/abs/2401.13867v1
- Date: Thu, 25 Jan 2024 00:34:16 GMT
- Title: Unmasking and Quantifying Racial Bias of Large Language Models in
Medical Report Generation
- Authors: Yifan Yang, Xiaoyu Liu, Qiao Jin, Furong Huang, Zhiyong Lu
- Abstract summary: Large language models like GPT-3.5-turbo and GPT-4 hold promise for healthcare professionals.
These models tend to project higher costs and longer hospitalizations for White populations.
These biases mirror real-world healthcare disparities.
- Score: 36.15505795527914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models like GPT-3.5-turbo and GPT-4 hold promise for
healthcare professionals, but they may inadvertently inherit biases during
their training, potentially affecting their utility in medical applications.
Despite few attempts in the past, the precise impact and extent of these biases
remain uncertain. Through both qualitative and quantitative analyses, we find
that these models tend to project higher costs and longer hospitalizations for
White populations and exhibit optimistic views in challenging medical scenarios
with much higher survival rates. These biases, which mirror real-world
healthcare disparities, are evident in the generation of patient backgrounds,
the association of specific diseases with certain races, and disparities in
treatment recommendations, etc. Our findings underscore the critical need for
future research to address and mitigate biases in language models, especially
in critical healthcare applications, to ensure fair and accurate outcomes for
all patients.
Related papers
- Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams [32.77551245372691]
Existing benchmarks for evaluating Large Language Models (LLMs) in healthcare predominantly focus on medical doctors.
We introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese.
EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists.
arXiv Detail & Related papers (2024-06-17T08:40:36Z) - Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias [3.455189439319919]
We introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in large language models (LLMs)
We evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs.
Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups.
arXiv Detail & Related papers (2024-05-09T02:33:14Z) - Seeds of Stereotypes: A Large-Scale Textual Analysis of Race and Gender Associations with Diseases in Online Sources [1.8259644946867188]
The study analyzed the context in which various diseases are discussed alongside markers of race and gender.
We found that demographic terms are disproportionately associated with specific disease concepts in online texts.
We find widespread disparities in the associations of specific racial and gender terms with the 18 diseases analyzed.
arXiv Detail & Related papers (2024-05-08T13:38:56Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - What's in a Name? Auditing Large Language Models for Race and Gender
Bias [49.28899492966893]
We employ an audit design to investigate biases in state-of-the-art large language models, including GPT-4.
We find that the advice systematically disadvantages names that are commonly associated with racial minorities and women.
arXiv Detail & Related papers (2024-02-21T18:25:25Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Evaluating the Fairness of Deep Learning Uncertainty Estimates in
Medical Image Analysis [3.5536769591744557]
Deep learning (DL) models have shown great success in many medical image analysis tasks.
However, deployment of the resulting models into real clinical contexts requires robustness and fairness across different sub-populations.
Recent studies have shown significant biases in DL models across demographic subgroups, indicating a lack of fairness in the models.
arXiv Detail & Related papers (2023-03-06T16:01:30Z) - What Do You See in this Patient? Behavioral Testing of Clinical NLP
Models [69.09570726777817]
We introduce an extendable testing framework that evaluates the behavior of clinical outcome models regarding changes of the input.
We show that model behavior varies drastically even when fine-tuned on the same data and that allegedly best-performing models have not always learned the most medically plausible patterns.
arXiv Detail & Related papers (2021-11-30T15:52:04Z) - MIMIC-IF: Interpretability and Fairness Evaluation of Deep Learning
Models on MIMIC-IV Dataset [15.436560770086205]
We focus on MIMIC-IV (Medical Information Mart for Intensive Care, version IV), the largest publicly available healthcare dataset.
We conduct comprehensive analyses of dataset representation bias as well as interpretability and prediction fairness of deep learning models for in-hospital mortality prediction.
arXiv Detail & Related papers (2021-02-12T20:28:06Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.