RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models
- URL: http://arxiv.org/abs/2009.11462v2
- Date: Fri, 25 Sep 2020 20:22:26 GMT
- Title: RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models
- Authors: Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A.
Smith
- Abstract summary: Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
- Score: 93.151822563361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained neural language models (LMs) are prone to generating racist,
sexist, or otherwise toxic language which hinders their safe deployment. We
investigate the extent to which pretrained LMs can be prompted to generate
toxic language, and the effectiveness of controllable text generation
algorithms at preventing such toxic degeneration. We create and release
RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level
prompts derived from a large corpus of English web text, paired with toxicity
scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we
find that pretrained LMs can degenerate into toxic text even from seemingly
innocuous prompts. We empirically assess several controllable generation
methods, and find that while data- or compute-intensive methods (e.g., adaptive
pretraining on non-toxic data) are more effective at steering away from
toxicity than simpler solutions (e.g., banning "bad" words), no current method
is failsafe against neural toxic degeneration. To pinpoint the potential cause
of such persistent toxic degeneration, we analyze two web text corpora used to
pretrain several LMs (including GPT-2; Radford et. al, 2019), and find a
significant amount of offensive, factually unreliable, and otherwise toxic
content. Our work provides a test bed for evaluating toxic generations by LMs
and stresses the need for better data selection processes for pretraining.
Related papers
- Toxic Subword Pruning for Dialogue Response Generation on Large Language Models [51.713448010799986]
We propose textbfToxic Subword textbfPruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs.
ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously.
arXiv Detail & Related papers (2024-10-05T13:30:33Z) - Mitigating Text Toxicity with Counterfactual Generation [0.3250512744763586]
Toxicity mitigation consists in rephrasing text in order to remove harmful meaning.
Current methods fail to detoxify text while preserving the initial non-toxic meaning.
This work is the first to bridge the gap between counterfactual generation and text detoxification.
arXiv Detail & Related papers (2024-05-16T09:52:21Z) - PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models [27.996123856250065]
Existing toxicity benchmarks are overwhelmingly focused on English.
We introduce PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages.
arXiv Detail & Related papers (2024-05-15T14:22:33Z) - Fine-Grained Detoxification via Instance-Level Prefixes for Large
Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost.
FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt.
We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z) - Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use.
We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting.
We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z) - Detoxifying Text with MaRCo: Controllable Revision with Experts and
Anti-Experts [57.38912708076231]
We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods.
MaRCo uses likelihoods under a non-toxic LM and a toxic LM to find candidate words to mask and potentially replace.
We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $times$ more in human evaluation.
arXiv Detail & Related papers (2022-12-20T18:50:00Z) - Leashing the Inner Demons: Self-Detoxification for Language Models [13.576289320208511]
Language models (LMs) can reproduce (or amplify) toxic language seen during training.
We analyze the impact of prompts, decoding strategies and training corpora on the output.
We propose a simple yet effective method for language models to "detoxify" themselves without an additional large corpus or external discriminator.
arXiv Detail & Related papers (2022-03-06T23:55:12Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.