Challenges in Automated Debiasing for Toxic Language Detection
- URL: http://arxiv.org/abs/2102.00086v1
- Date: Fri, 29 Jan 2021 22:03:17 GMT
- Title: Challenges in Automated Debiasing for Toxic Language Detection
- Authors: Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Noah A. Smith, Yejin Choi
- Abstract summary: Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
- Score: 81.04406231100323
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Biased associations have been a challenge in the development of classifiers
for detecting toxic language, hindering both fairness and accuracy. As
potential solutions, we investigate recently introduced debiasing methods for
text classification datasets and models, as applied to toxic language
detection. Our focus is on lexical (e.g., swear words, slurs, identity
mentions) and dialectal markers (specifically African American English). Our
comprehensive experiments establish that existing methods are limited in their
ability to prevent biased behavior in current toxicity detectors. We then
propose an automatic, dialect-aware data correction method, as a
proof-of-concept. Despite the use of synthetic labels, this method reduces
dialectal associations with toxicity. Overall, our findings show that debiasing
a model trained on biased toxic language data is not as effective as simply
relabeling the data to remove existing biases.
Related papers
- Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - Comparing Hallucination Detection Metrics for Multilingual Generation [62.97224994631494]
This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages.
We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality.
Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
arXiv Detail & Related papers (2024-02-16T08:10:34Z) - Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination [54.865941973768905]
We propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings.
CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method.
Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge.
arXiv Detail & Related papers (2023-11-16T07:16:55Z) - Toxicity Detection with Generative Prompt-based Inference [3.9741109244650823]
It is a long-known risk that language models (LMs), once trained on corpus containing undesirable content, have the power to manifest biases and toxicity.
In this work, we explore the generative variant of zero-shot prompt-based toxicity detection with comprehensive trials on prompt engineering.
arXiv Detail & Related papers (2022-05-24T22:44:43Z) - Detoxifying Language Models with a Toxic Corpus [16.7345472998388]
We propose to use toxic corpus as an additional resource to reduce the toxicity.
Our result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially.
arXiv Detail & Related papers (2022-04-30T18:25:18Z) - Reward Modeling for Mitigating Toxicity in Transformer-based Language
Models [0.0]
Transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks.
Language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors.
We propose Reinforce-Detoxify; A reinforcement learning-based method for mitigating toxicity in language models.
arXiv Detail & Related papers (2022-02-19T19:26:22Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - RECAST: Enabling User Recourse and Interpretability of Toxicity
Detection Models with Interactive Visualization [16.35961310670002]
We present our work, RECAST, an interactive, open-sourced web tool for visualizing toxic models' predictions.
We found that RECAST was highly effective at helping users reduce toxicity as detected through the model.
This opens a discussion for how toxicity detection models work and should work, and their effect on the future of online discourse.
arXiv Detail & Related papers (2021-02-08T18:37:50Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.