Diagnosing and Debiasing Corpus-Based Political Bias and Insults in GPT2
- URL: http://arxiv.org/abs/2311.10266v1
- Date: Fri, 17 Nov 2023 01:20:08 GMT
- Title: Diagnosing and Debiasing Corpus-Based Political Bias and Insults in GPT2
- Authors: Ambri Ma, Arnav Kumar, Brett Zeligson
- Abstract summary: Training large language models (LLMs) on extensive, unfiltered corpora sourced from the internet is a common and advantageous practice.
Recent research shows that generative pretrained transformer (GPT) language models can recognize their own biases and detect toxicity in generated content.
This study investigates the efficacy of the diagnosing-debiasing approach in mitigating two additional types of biases: insults and political bias.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The training of large language models (LLMs) on extensive, unfiltered corpora
sourced from the internet is a common and advantageous practice. Consequently,
LLMs have learned and inadvertently reproduced various types of biases,
including violent, offensive, and toxic language. However, recent research
shows that generative pretrained transformer (GPT) language models can
recognize their own biases and detect toxicity in generated content, a process
referred to as self-diagnosis. In response, researchers have developed a
decoding algorithm that allows LLMs to self-debias, or reduce their likelihood
of generating harmful text. This study investigates the efficacy of the
diagnosing-debiasing approach in mitigating two additional types of biases:
insults and political bias. These biases are often used interchangeably in
discourse, despite exhibiting potentially dissimilar semantic and syntactic
properties. We aim to contribute to the ongoing effort of investigating the
ethical and social implications of human-AI interaction.
Related papers
- Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation [18.150899267807965]
We study an unlearning-based approach to debiasing in large language models (LLMs)
We propose a mask language modeling unlearning technique, which unlearns the harmful part of the text.
Experimental results demonstrate the effectiveness of our approach in diminishing bias while maintaining the language modeling abilities.
arXiv Detail & Related papers (2024-07-24T02:37:42Z) - The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models [78.69526166193236]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases.
We propose sc Social Bias Neurons to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias.
As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
arXiv Detail & Related papers (2024-06-14T15:41:06Z) - A survey of recent methods for addressing AI fairness and bias in
biomedicine [48.46929081146017]
Artificial intelligence systems may perpetuate social inequities or demonstrate biases, such as those based on race or gender.
We surveyed recent publications on different debiasing methods in the fields of biomedical natural language processing (NLP) or computer vision (CV)
We performed a literature search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published between January 2018 and December 2023 using multiple combinations of keywords.
We reviewed other potential methods from the general domain that could be applied to biomedicine to address bias and improve fairness.
arXiv Detail & Related papers (2024-02-13T06:38:46Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - On Bias and Fairness in NLP: Investigating the Impact of Bias and Debiasing in Language Models on the Fairness of Toxicity Detection [7.297345802761503]
representation bias, selection bias and overamplification bias are investigated.
We show that overamplification bias is the most impactful type of bias on the fairness of the task of toxicity detection.
We introduce a list of guidelines to ensure the fairness of the task of toxicity detection.
arXiv Detail & Related papers (2023-05-22T08:44:00Z) - Towards an Enhanced Understanding of Bias in Pre-trained Neural Language
Models: A Survey with Special Emphasis on Affective Bias [2.6304695993930594]
We present a survey to comprehend bias in large pre-trained language models, analyze the stages at which they occur, and various ways in which these biases could be quantified and mitigated.
Considering wide applicability of textual affective computing based downstream tasks in real-world systems such as business, healthcare, education, etc., we give a special emphasis on investigating bias in the context of affect (emotion) i.e., Affective Bias.
We present a summary of various bias evaluation corpora that help to aid future research and discuss challenges in the research on bias in pre-trained language models.
arXiv Detail & Related papers (2022-04-21T18:51:19Z) - Identification of Bias Against People with Disabilities in Sentiment
Analysis and Toxicity Detection Models [0.5758109624133713]
We present the Bias Identification Test in Sentiments (BITS), a corpus of 1,126 sentences designed to probe sentiment analysis models for biases in disability.
Results show that all exhibit strong negative biases on sentences that mention disability.
arXiv Detail & Related papers (2021-11-25T21:44:18Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - Towards Controllable Biases in Language Generation [87.89632038677912]
We develop a method to induce societal biases in generated text when input prompts contain mentions of specific demographic groups.
We analyze two scenarios: 1) inducing negative biases for one demographic and positive biases for another demographic, and 2) equalizing biases between demographics.
arXiv Detail & Related papers (2020-05-01T08:25:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.