Related papers: Stereotype Detection in LLMs: A Multiclass, Explainable, and Benchmark-Driven Approach

Stereotype Detection in LLMs: A Multiclass, Explainable, and Benchmark-Driven Approach

URL: http://arxiv.org/abs/2404.01768v2
Date: Sat, 16 Nov 2024 00:54:09 GMT
Title: Stereotype Detection in LLMs: A Multiclass, Explainable, and Benchmark-Driven Approach
Authors: Zekun Wu, Sahan Bulathwela, Maria Perez-Ortiz, Adriano Soares Koshiyama,
Abstract summary: This paper introduces the Multi-Grain Stereotype (MGS) dataset, consisting of 51,867 instances across gender, race, profession, religion, and other stereotypes. We evaluate various machine learning approaches to establish baselines and fine-tune language models of different architectures and sizes. We employ explainable AI (XAI) tools, including SHAP, LIME, and BertViz, to assess whether the model's learned patterns align with human intuitions about stereotypes.
Score: 4.908389661988191
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Stereotype detection is a challenging and subjective task, as certain statements, such as "Black people like to play basketball," may not appear overtly toxic but still reinforce racial stereotypes. With the increasing prevalence of large language models (LLMs) in human-facing artificial intelligence (AI) applications, detecting these types of biases is essential. However, LLMs risk perpetuating and amplifying stereotypical outputs derived from their training data. A reliable stereotype detector is crucial for benchmarking bias, monitoring model input and output, filtering training data, and ensuring fairer model behavior in downstream applications. This paper introduces the Multi-Grain Stereotype (MGS) dataset, consisting of 51,867 instances across gender, race, profession, religion, and other stereotypes, curated from multiple existing datasets. We evaluate various machine learning approaches to establish baselines and fine-tune language models of different architectures and sizes, presenting a suite of stereotype multiclass classifiers trained on the MGS dataset. Given the subjectivity of stereotypes, explainability is essential to align model learning with human understanding of stereotypes. We employ explainable AI (XAI) tools, including SHAP, LIME, and BertViz, to assess whether the model's learned patterns align with human intuitions about stereotypes.Additionally, we develop stereotype elicitation prompts and benchmark the presence of stereotypes in text generation tasks using popular LLMs, employing the best-performing stereotype classifiers.

Related papers

Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation [0.0]
Large Language models (LLMs) have gained popularity in recent years with the advancement of Natural Language Processing (NLP)<n>This study inspects and highlights the need to address biases in LLMs amid growing generative Artificial Intelligence (AI)<n>We utilize bias-specific benchmarks such StereoSet and CrowSPairs to evaluate the existence of various biases in many different generative models such as BERT, GPT 3.5, and ADA.
arXiv Detail & Related papers (2025-11-18T05:43:34Z)
Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach [36.64093052736432]
Bias and stereotypes in language models can cause harm, especially in sensitive areas like content moderation and decision-making.<n>This paper addresses bias and stereotype detection by exploring how jointly learning these tasks enhances model performance.<n>We introduce StereoBias, a unique dataset labeled for bias and stereotype detection across five categories: religion, gender, socio-economic status, race, profession, and others.
arXiv Detail & Related papers (2025-07-02T13:46:00Z)
Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings [41.09752906121257]
We propose a four-tuple definition and provide precise terminology distinguishing stereotype, anti-stereotype, stereotypical bias, and bias. We demonstrate that language models for reasoning with fewer than 10B parameters often get confused when detecting anti-stereotypes.
arXiv Detail & Related papers (2025-04-04T11:14:38Z)
Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs) By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases. The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z)
Who is better at math, Jenny or Jingzhen? Uncovering Stereotypes in Large Language Models [9.734705470760511]
We use GlobalBias to study a broad set of stereotypes from around the world. We generate character profiles based on given names and evaluate the prevalence of stereotypes in model outputs.
arXiv Detail & Related papers (2024-07-09T14:52:52Z)
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping. We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z)
Towards Auditing Large Language Models: Improving Text-based Stereotype Detection [5.3634450268516565]
This work introduces i) the Multi-Grain Stereotype dataset, which includes 52,751 instances of gender, race, profession and religion stereotypic text. We design several experiments to rigorously test the proposed model trained on the novel dataset. Experiments show that training the model in a multi-class setting can outperform the one-vs-all binary counterpart.
arXiv Detail & Related papers (2023-11-23T17:47:14Z)
Will the Prince Get True Love's Kiss? On the Model Sensitivity to Gender Perturbation over Fairytale Texts [80.21033860436081]
We investigate how models respond to gender stereotype perturbations through counterfactual data augmentation. Our results show that models exhibit slight performance drops when faced with gender perturbations in the test set. When fine-tuned on counterfactual training data, models become more robust to anti-stereotypical narratives.
arXiv Detail & Related papers (2023-10-16T22:25:09Z)
Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z)
Counteracts: Testing Stereotypical Representation in Pre-trained Language Models [4.211128681972148]
We use counterexamples to examine the internal stereotypical knowledge in pre-trained language models (PLMs) We evaluate 7 PLMs on 9 types of cloze-style prompt with different information and base knowledge.
arXiv Detail & Related papers (2023-01-11T07:52:59Z)
Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale [61.555788332182395]
We investigate the potential for machine learning models to amplify dangerous and complex stereotypes. We find a broad range of ordinary prompts produce stereotypes, including prompts simply mentioning traits, descriptors, occupations, or objects.
arXiv Detail & Related papers (2022-11-07T18:31:07Z)
Towards Understanding and Mitigating Social Biases in Language Models [107.82654101403264]
Large-scale pretrained language models (LMs) can be potentially dangerous in manifesting undesirable representational biases. We propose steps towards mitigating social biases during text generation. Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information.
arXiv Detail & Related papers (2021-06-24T17:52:43Z)
UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors. We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z)
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models [30.582132471411263]
We introduce the Crowd Stereotype Pairs benchmark (CrowS-Pairs) CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. We find that all three of the widely-used sentences we evaluate substantially favor stereotypes in every category in CrowS-Pairs.
arXiv Detail & Related papers (2020-09-30T22:38:40Z)
StereoSet: Measuring stereotypical bias in pretrained language models [24.020149562072127]
We present StereoSet, a large-scale natural dataset in English to measure stereotypical biases in four domains. We evaluate popular models like BERT, GPT-2, RoBERTa, and XLNet on our dataset and show that these models exhibit strong stereotypical biases.
arXiv Detail & Related papers (2020-04-20T17:14:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.