Detoxifying Language Models Risks Marginalizing Minority Voices
- URL: http://arxiv.org/abs/2104.06390v1
- Date: Tue, 13 Apr 2021 17:52:01 GMT
- Title: Detoxifying Language Models Risks Marginalizing Minority Voices
- Authors: Albert Xu, Eshaan Pathak, Eric Wallace, Suchin Gururangan, Maarten
Sap, Dan Klein
- Abstract summary: Language models (LMs) must be both safe and equitable to be responsibly deployed in practice.
detoxification techniques have been proposed to mitigate toxic LM generations.
We show that current detoxification techniques hurt equity: they decrease the utility of LMs on language used by marginalized groups.
- Score: 40.918564746367586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models (LMs) must be both safe and equitable to be responsibly
deployed in practice. With safety in mind, numerous detoxification techniques
(e.g., Dathathri et al. 2020; Krause et al. 2020) have been proposed to
mitigate toxic LM generations. In this work, we show that current
detoxification techniques hurt equity: they decrease the utility of LMs on
language used by marginalized groups (e.g., African-American English and
minority identity mentions). In particular, we perform automatic and human
evaluations of text generation quality when LMs are conditioned on inputs with
different dialects and group identifiers. We find that detoxification makes LMs
more brittle to distribution shift, especially on language used by marginalized
groups. We identify that these failures stem from detoxification methods
exploiting spurious correlations in toxicity datasets. Overall, our results
highlight the tension between the controllability and distributional robustness
of LMs.
Related papers
- Mitigating Text Toxicity with Counterfactual Generation [0.3250512744763586]
Toxicity mitigation consists in rephrasing text in order to remove harmful meaning.
Current methods fail to detoxify text while preserving the initial non-toxic meaning.
This work is the first to bridge the gap between counterfactual generation and text detoxification.
arXiv Detail & Related papers (2024-05-16T09:52:21Z) - PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models [27.996123856250065]
Existing toxicity benchmarks are overwhelmingly focused on English.
We introduce PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages.
arXiv Detail & Related papers (2024-05-15T14:22:33Z) - Detoxifying Text with MaRCo: Controllable Revision with Experts and
Anti-Experts [57.38912708076231]
We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods.
MaRCo uses likelihoods under a non-toxic LM and a toxic LM to find candidate words to mask and potentially replace.
We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $times$ more in human evaluation.
arXiv Detail & Related papers (2022-12-20T18:50:00Z) - Language Detoxification with Attribute-Discriminative Latent Space [59.167432249229584]
Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks.
They can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications.
We propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space.
arXiv Detail & Related papers (2022-10-19T06:54:42Z) - Challenges in Detoxifying Language Models [44.48396735574315]
Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks.
Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world.
We evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation.
arXiv Detail & Related papers (2021-09-15T17:27:06Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.