Related papers: Evaluation of Bias Towards Medical Professionals in Large Language Models

Evaluation of Bias Towards Medical Professionals in Large Language Models

URL: http://arxiv.org/abs/2407.12031v1
Date: Sun, 30 Jun 2024 05:55:55 GMT
Title: Evaluation of Bias Towards Medical Professionals in Large Language Models
Authors: Xi Chen, Yang Xu, MingKe You, Li Wang, WeiZhi Liu, Jian Li,
Abstract summary: GPT-4, Claude-3, and Mistral-Large showed significant gender and racial biases when evaluating medical professionals for residency selection. Tests revealed strong preferences towards Hispanic females and Asian males in various specialties.
Score: 11.450991679521605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study evaluates whether large language models (LLMs) exhibit biases towards medical professionals. Fictitious candidate resumes were created to control for identity factors while maintaining consistent qualifications. Three LLMs (GPT-4, Claude-3-haiku, and Mistral-Large) were tested using a standardized prompt to evaluate resumes for specific residency programs. Explicit bias was tested by changing gender and race information, while implicit bias was tested by changing names while hiding race and gender. Physician data from the Association of American Medical Colleges was used to compare with real-world demographics. 900,000 resumes were evaluated. All LLMs exhibited significant gender and racial biases across medical specialties. Gender preferences varied, favoring male candidates in surgery and orthopedics, while preferring females in dermatology, family medicine, obstetrics and gynecology, pediatrics, and psychiatry. Claude-3 and Mistral-Large generally favored Asian candidates, while GPT-4 preferred Black and Hispanic candidates in several specialties. Tests revealed strong preferences towards Hispanic females and Asian males in various specialties. Compared to real-world data, LLMs consistently chose higher proportions of female and underrepresented racial candidates than their actual representation in the medical workforce. GPT-4, Claude-3, and Mistral-Large showed significant gender and racial biases when evaluating medical professionals for residency selection. These findings highlight the potential for LLMs to perpetuate biases and compromise healthcare workforce diversity if used without proper bias mitigation strategies.

Related papers

The Root Shapes the Fruit: On the Persistence of Gender-Exclusive Harms in Aligned Language Models [58.130894823145205]
We center transgender, nonbinary, and other gender-diverse identities to investigate how alignment procedures interact with pre-existing gender-diverse bias. Our findings reveal that DPO-aligned models are particularly sensitive to supervised finetuning. We conclude with recommendations tailored to DPO and broader alignment practices.
arXiv Detail & Related papers (2024-11-06T06:50:50Z)
Evaluating Gender, Racial, and Age Biases in Large Language Models: A Comparative Analysis of Occupational and Crime Scenarios [0.0]
This paper examines bias in Large Language Models (LLMs) Findings reveal that LLMs often depict female characters more frequently than male ones in various occupations. Efforts to reduce gender and racial bias often lead to outcomes that may over-index one sub-class.
arXiv Detail & Related papers (2024-09-22T20:21:20Z)
From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models [3.1927733045184885]
This study addresses the racial biases in facial expression recognition (FER) systems within Large Multimodal Foundation Models (LMFMs) Existing research predominantly focuses on traditional FER models (CNNs, RNNs, ViTs) leaving a gap in understanding racial biases in LMFMs. We benchmark four leading LMFMs: GPT-4o, PaliGemma, Gemini, and CLIP to assess their performance in facial emotion detection across different racial demographics.
arXiv Detail & Related papers (2024-08-27T07:54:01Z)
GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models [73.23743278545321]
Large language models (LLMs) have exhibited remarkable capabilities in natural language generation, but have also been observed to magnify societal biases. GenderCARE is a comprehensive framework that encompasses innovative Criteria, bias Assessment, Reduction techniques, and Evaluation metrics.
arXiv Detail & Related papers (2024-08-22T15:35:46Z)
GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing [72.0343083866144]
This paper introduces the GenderBias-emphVL benchmark to evaluate occupation-related gender bias in Large Vision-Language Models. Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs and state-of-the-art commercial APIs. Our findings reveal widespread gender biases in existing LVLMs.
arXiv Detail & Related papers (2024-06-30T05:55:15Z)
"You Gotta be a Doctor, Lin": An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations [29.183942575629214]
We utilize GPT-3.5-Turbo and Llama 3-70B-Instruct to simulate hiring decisions and salary recommendations for candidates with 320 first names that strongly signal their race and gender. Our empirical results indicate a preference among these models for hiring candidates with White female-sounding names over other demographic groups across 40 occupations.
arXiv Detail & Related papers (2024-06-18T03:11:43Z)
Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender? [15.362940175441048]
We examine whether large language models (LLMs) exhibit race- and gender-based name discrimination in hiring decisions. We design a series of templatic prompts to LLMs to write an email to a named job applicant informing them of a hiring decision. By manipulating the applicant's first name, we measure the effect of perceived race, ethnicity, and gender on the probability that the LLM generates an acceptance or rejection email.
arXiv Detail & Related papers (2024-06-15T03:31:16Z)
White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs [58.27353205269664]
Social biases can manifest in language agency. We introduce the novel Language Agency Bias Evaluation benchmark. We unveil language agency social biases in 3 recent Large Language Model (LLM)-generated content.
arXiv Detail & Related papers (2024-04-16T12:27:54Z)
What's in a Name? Auditing Large Language Models for Race and Gender Bias [49.28899492966893]
We employ an audit design to investigate biases in state-of-the-art large language models, including GPT-4. We find that the advice systematically disadvantages names that are commonly associated with racial minorities and women.
arXiv Detail & Related papers (2024-02-21T18:25:25Z)
Probing Explicit and Implicit Gender Bias through LLM Conditional Text Generation [64.79319733514266]
Large Language Models (LLMs) can generate biased and toxic responses. We propose a conditional text generation mechanism without the need for predefined gender phrases and stereotypes.
arXiv Detail & Related papers (2023-11-01T05:31:46Z)
Assessing Group-level Gender Bias in Professional Evaluations: The Case of Medical Student End-of-Shift Feedback [14.065979111248497]
Female physicians tend to be underrepresented in senior positions, make less money than their male counterparts and receive fewer promotions. This work was mainly conducted by looking for specific words using fixed dictionaries such as LIWC and focused on recommendation letters. We use a dataset of written and quantitative assessments of medical student performance on individual shifts of work, collected across multiple institutions, to investigate the extent to which gender bias exists in a day-to-day context for medical students.
arXiv Detail & Related papers (2022-06-01T05:01:36Z)
How True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases [50.591267188664666]
Downstream applications are at risk of inheriting biases contained in natural language models. We analyze the occupational biases of a popular generative language model, GPT-2. For a given job, GPT-2 reflects the societal skew of gender and ethnicity in the US, and in some cases, pulls the distribution towards gender parity.
arXiv Detail & Related papers (2021-02-08T11:10:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.