Auditing Large Language Models for Enhanced Text-Based Stereotype Detection and Probing-Based Bias Evaluation
- URL: http://arxiv.org/abs/2404.01768v1
- Date: Tue, 2 Apr 2024 09:31:32 GMT
- Title: Auditing Large Language Models for Enhanced Text-Based Stereotype Detection and Probing-Based Bias Evaluation
- Authors: Zekun Wu, Sahan Bulathwela, Maria Perez-Ortiz, Adriano Soares Koshiyama,
- Abstract summary: This work introduces the Multi-Grain Stereotype dataset, encompassing 51,867 instances across gender, race, profession, religion, and stereotypical text.
We explore different machine learning approaches aimed at establishing baselines for stereotype detection.
We develop a series of stereotype elicitation prompts and evaluate the presence of stereotypes in text generation tasks with popular Large Language Models.
- Score: 4.908389661988191
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have significantly increased their presence in human-facing Artificial Intelligence (AI) applications. However, LLMs could reproduce and even exacerbate stereotypical outputs from training data. This work introduces the Multi-Grain Stereotype (MGS) dataset, encompassing 51,867 instances across gender, race, profession, religion, and stereotypical text, collected by fusing multiple previously publicly available stereotype detection datasets. We explore different machine learning approaches aimed at establishing baselines for stereotype detection, and fine-tune several language models of various architectures and model sizes, presenting in this work a series of stereotypes classifier models for English text trained on MGS. To understand whether our stereotype detectors capture relevant features (aligning with human common sense) we utilise a variety of explanainable AI tools, including SHAP, LIME, and BertViz, and analyse a series of example cases discussing the results. Finally, we develop a series of stereotype elicitation prompts and evaluate the presence of stereotypes in text generation tasks with popular LLMs, using one of our best performing previously presented stereotypes detectors. Our experiments yielded several key findings: i) Training stereotype detectors in a multi-dimension setting yields better results than training multiple single-dimension classifiers.ii) The integrated MGS Dataset enhances both the in-dataset and cross-dataset generalisation ability of stereotype detectors compared to using the datasets separately. iii) There is a reduction in stereotypes in the content generated by GPT Family LLMs with newer versions.
Related papers
- Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach [36.64093052736432]
Bias and stereotypes in language models can cause harm, especially in sensitive areas like content moderation and decision-making.<n>This paper addresses bias and stereotype detection by exploring how jointly learning these tasks enhances model performance.<n>We introduce StereoBias, a unique dataset labeled for bias and stereotype detection across five categories: religion, gender, socio-economic status, race, profession, and others.
arXiv Detail & Related papers (2025-07-02T13:46:00Z) - Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings [41.09752906121257]
We propose a four-tuple definition and provide precise terminology distinguishing stereotype, anti-stereotype, stereotypical bias, and bias.
We demonstrate that language models for reasoning with fewer than 10B parameters often get confused when detecting anti-stereotypes.
arXiv Detail & Related papers (2025-04-04T11:14:38Z) - Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - Who is better at math, Jenny or Jingzhen? Uncovering Stereotypes in Large Language Models [9.734705470760511]
We use GlobalBias to study a broad set of stereotypes from around the world.
We generate character profiles based on given names and evaluate the prevalence of stereotypes in model outputs.
arXiv Detail & Related papers (2024-07-09T14:52:52Z) - Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z) - Towards Auditing Large Language Models: Improving Text-based Stereotype
Detection [5.3634450268516565]
This work introduces i) the Multi-Grain Stereotype dataset, which includes 52,751 instances of gender, race, profession and religion stereotypic text.
We design several experiments to rigorously test the proposed model trained on the novel dataset.
Experiments show that training the model in a multi-class setting can outperform the one-vs-all binary counterpart.
arXiv Detail & Related papers (2023-11-23T17:47:14Z) - Will the Prince Get True Love's Kiss? On the Model Sensitivity to Gender Perturbation over Fairytale Texts [80.21033860436081]
We investigate how models respond to gender stereotype perturbations through counterfactual data augmentation.
Our results show that models exhibit slight performance drops when faced with gender perturbations in the test set.
When fine-tuned on counterfactual training data, models become more robust to anti-stereotypical narratives.
arXiv Detail & Related papers (2023-10-16T22:25:09Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Counteracts: Testing Stereotypical Representation in Pre-trained
Language Models [4.211128681972148]
We use counterexamples to examine the internal stereotypical knowledge in pre-trained language models (PLMs)
We evaluate 7 PLMs on 9 types of cloze-style prompt with different information and base knowledge.
arXiv Detail & Related papers (2023-01-11T07:52:59Z) - Easily Accessible Text-to-Image Generation Amplifies Demographic
Stereotypes at Large Scale [61.555788332182395]
We investigate the potential for machine learning models to amplify dangerous and complex stereotypes.
We find a broad range of ordinary prompts produce stereotypes, including prompts simply mentioning traits, descriptors, occupations, or objects.
arXiv Detail & Related papers (2022-11-07T18:31:07Z) - Towards Understanding and Mitigating Social Biases in Language Models [107.82654101403264]
Large-scale pretrained language models (LMs) can be potentially dangerous in manifesting undesirable representational biases.
We propose steps towards mitigating social biases during text generation.
Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information.
arXiv Detail & Related papers (2021-06-24T17:52:43Z) - UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions.
We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors.
We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z) - CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked
Language Models [30.582132471411263]
We introduce the Crowd Stereotype Pairs benchmark (CrowS-Pairs)
CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age.
We find that all three of the widely-used sentences we evaluate substantially favor stereotypes in every category in CrowS-Pairs.
arXiv Detail & Related papers (2020-09-30T22:38:40Z) - StereoSet: Measuring stereotypical bias in pretrained language models [24.020149562072127]
We present StereoSet, a large-scale natural dataset in English to measure stereotypical biases in four domains.
We evaluate popular models like BERT, GPT-2, RoBERTa, and XLNet on our dataset and show that these models exhibit strong stereotypical biases.
arXiv Detail & Related papers (2020-04-20T17:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.