Detoxifying Text with MaRCo: Controllable Revision with Experts and
Anti-Experts
- URL: http://arxiv.org/abs/2212.10543v2
- Date: Fri, 26 May 2023 20:26:06 GMT
- Title: Detoxifying Text with MaRCo: Controllable Revision with Experts and
Anti-Experts
- Authors: Skyler Hallinan, Alisa Liu, Yejin Choi, Maarten Sap
- Abstract summary: We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods.
MaRCo uses likelihoods under a non-toxic LM and a toxic LM to find candidate words to mask and potentially replace.
We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $times$ more in human evaluation.
- Score: 57.38912708076231
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text detoxification has the potential to mitigate the harms of toxicity by
rephrasing text to remove offensive meaning, but subtle toxicity remains
challenging to tackle. We introduce MaRCo, a detoxification algorithm that
combines controllable generation and text rewriting methods using a Product of
Experts with autoencoder language models (LMs). MaRCo uses likelihoods under a
non-toxic LM (expert) and a toxic LM (anti-expert) to find candidate words to
mask and potentially replace. We evaluate our method on several subtle toxicity
and microaggressions datasets, and show that it not only outperforms baselines
on automatic metrics, but MaRCo's rewrites are preferred 2.1 $\times$ more in
human evaluation. Its applicability to instances of subtle toxicity is
especially promising, demonstrating a path forward for addressing increasingly
elusive online hate.
Related papers
- Toxic Subword Pruning for Dialogue Response Generation on Large Language Models [51.713448010799986]
We propose textbfToxic Subword textbfPruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs.
ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously.
arXiv Detail & Related papers (2024-10-05T13:30:33Z) - Mitigating Text Toxicity with Counterfactual Generation [0.3250512744763586]
Toxicity mitigation consists in rephrasing text in order to remove harmful meaning.
Current methods fail to detoxify text while preserving the initial non-toxic meaning.
This work is the first to bridge the gap between counterfactual generation and text detoxification.
arXiv Detail & Related papers (2024-05-16T09:52:21Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z) - Fine-Grained Detoxification via Instance-Level Prefixes for Large
Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost.
FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt.
We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z) - Parameter-Efficient Detoxification with Contrastive Decoding [78.5124331048714]
We introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles.
During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step.
We find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality.
arXiv Detail & Related papers (2024-01-13T01:46:20Z) - Comprehensive Assessment of Toxicity in ChatGPT [49.71090497696024]
We evaluate the toxicity in ChatGPT by utilizing instruction-tuning datasets.
prompts in creative writing tasks can be 2x more likely to elicit toxic responses.
Certain deliberately toxic prompts, designed in earlier studies, no longer yield harmful responses.
arXiv Detail & Related papers (2023-11-03T14:37:53Z) - Challenges in Detoxifying Language Models [44.48396735574315]
Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks.
Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world.
We evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation.
arXiv Detail & Related papers (2021-09-15T17:27:06Z) - Detoxifying Language Models Risks Marginalizing Minority Voices [40.918564746367586]
Language models (LMs) must be both safe and equitable to be responsibly deployed in practice.
detoxification techniques have been proposed to mitigate toxic LM generations.
We show that current detoxification techniques hurt equity: they decrease the utility of LMs on language used by marginalized groups.
arXiv Detail & Related papers (2021-04-13T17:52:01Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.