Related papers: Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications

Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications

URL: http://arxiv.org/abs/2510.08614v1
Date: Wed, 08 Oct 2025 01:11:06 GMT
Title: Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications
Authors: Mingxuan Liu, Yuhe Ke, Wentao Zhu, Mayli Mertens, Yilin Ning, Jingchi Liao, Chuan Hong, Daniel Shu Wei Ting, Yifan Peng, Danielle S. Bitterman, Marcus Eng Hock Ong, Nan Liu,
Abstract summary: The integration of large language models into healthcare holds promise to enhance clinical decision-making.<n>Gender has long influenced physician behaviors and patient outcomes.<n>Some models even displayed a systematic female-male disparity in their interpretation of patient gender.
Score: 16.066280458640676
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The integration of large language models (LLMs) into healthcare holds promise to enhance clinical decision-making, yet their susceptibility to biases remains a critical concern. Gender has long influenced physician behaviors and patient outcomes, raising concerns that LLMs assuming human-like roles, such as clinicians or medical educators, may replicate or amplify gender-related biases. Using case studies from the New England Journal of Medicine Challenge (NEJM), we assigned genders (female, male, or unspecified) to multiple open-source and proprietary LLMs. We evaluated their response consistency across LLM-gender assignments regarding both LLM-based diagnosis and models' judgments on the clinical relevance or necessity of patient gender. In our findings, diagnoses were relatively consistent across LLM genders for most models. However, for patient gender's relevance and necessity in LLM-based diagnosis, all models demonstrated substantial inconsistency across LLM genders, particularly for relevance judgements. Some models even displayed a systematic female-male disparity in their interpretation of patient gender. These findings present an underexplored bias that could undermine the reliability of LLMs in clinical practice, underscoring the need for routine checks of identity-assignment consistency when interacting with LLMs to ensure reliable and equitable AI-supported clinical care.

Related papers

A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z)
The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making [13.734312822024947]
We introduce MedPerturb, a dataset designed to evaluate medical Large Language Models (LLMs) under controlled perturbations of clinical input.<n>With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability.<n>We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or format reflect diverging treatment selections between humans and LLMs.
arXiv Detail & Related papers (2025-06-20T17:09:27Z)
From Promising Capability to Pervasive Bias: Assessing Large Language Models for Emergency Department Triage [6.135648377533492]
Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored.<n>We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions.<n>We assess multiple LLM-based approaches, ranging from continued pre-training to in-context learning, as well as machine learning approaches.
arXiv Detail & Related papers (2025-04-22T21:11:47Z)
Bias in Large Language Models Across Clinical Applications: A Systematic Review [0.0]
Large language models (LLMs) are rapidly being integrated into healthcare, promising to enhance various clinical tasks.<n>This systematic review investigates the prevalence, sources, manifestations, and clinical implications of bias in LLMs.
arXiv Detail & Related papers (2025-04-03T13:32:08Z)
Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their abilities to directly recall and apply factual medical knowledge remains under-explored.<n>We introduce the Medical Knowledge Judgment dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized vocabularies and knowledge graphs.<n>Through a binary classification framework, MKJ evaluates LLMs' grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
Gender Bias in LLM-generated Interview Responses [1.6124402884077915]
This study evaluates three LLMs to conduct a multifaceted audit of LLM-generated interview responses across models, question types, and jobs.<n>Our findings reveal that gender bias is consistent, and closely aligned with gender stereotypes and the dominance of jobs.
arXiv Detail & Related papers (2024-10-28T05:08:08Z)
How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? [2.7476176772825904]
This research investigates the evaluation and mitigation of bias in Large Language Models (LLMs) We introduce a novel Counterfactual Patient Variations (CPV) dataset derived from the JAMA Clinical Challenge. Using this dataset, we built a framework for bias evaluation, employing both Multiple Choice Questions (MCQs) and corresponding explanations.
arXiv Detail & Related papers (2024-10-21T23:14:10Z)
GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models [73.23743278545321]
Large language models (LLMs) have exhibited remarkable capabilities in natural language generation, but have also been observed to magnify societal biases.<n>GenderCARE is a comprehensive framework that encompasses innovative Criteria, bias Assessment, Reduction techniques, and Evaluation metrics.
arXiv Detail & Related papers (2024-08-22T15:35:46Z)
CLIMB: A Benchmark of Clinical Bias in Large Language Models [39.82307008221118]
Large language models (LLMs) are increasingly applied to clinical decision-making. Their potential to exhibit bias poses significant risks to clinical equity. Currently, there is a lack of benchmarks that systematically evaluate such clinical bias in LLMs.
arXiv Detail & Related papers (2024-07-07T03:41:51Z)
Probing Explicit and Implicit Gender Bias through LLM Conditional Text Generation [64.79319733514266]
Large Language Models (LLMs) can generate biased and toxic responses. We propose a conditional text generation mechanism without the need for predefined gender phrases and stereotypes.
arXiv Detail & Related papers (2023-11-01T05:31:46Z)
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters [97.11173801187816]
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content. This paper critically examines gender biases in LLM-generated reference letters.
arXiv Detail & Related papers (2023-10-13T16:12:57Z)
Auditing Algorithmic Fairness in Machine Learning for Health with Severity-Based LOGAN [70.76142503046782]
We propose supplementing machine learning-based (ML) healthcare tools for bias with SLOGAN, an automatic tool for capturing local biases in a clinical prediction task. LOGAN adapts an existing tool, LOcal Group biAs detectioN, by contextualizing group bias detection in patient illness severity and past medical history. On average, SLOGAN identifies larger fairness disparities in over 75% of patient groups than LOGAN while maintaining clustering quality.
arXiv Detail & Related papers (2022-11-16T08:04:12Z)
Detecting Shortcut Learning for Fair Medical AI using Shortcut Testing [62.9062883851246]
Machine learning holds great promise for improving healthcare, but it is critical to ensure that its use will not propagate or amplify health disparities. One potential driver of algorithmic unfairness, shortcut learning, arises when ML models base predictions on improper correlations in the training data. Using multi-task learning, we propose the first method to assess and mitigate shortcut learning as a part of the fairness assessment of clinical ML systems.
arXiv Detail & Related papers (2022-07-21T09:35:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.