Related papers: Mitigating Text Toxicity with Counterfactual Generation

Mitigating Text Toxicity with Counterfactual Generation

URL: http://arxiv.org/abs/2405.09948v2
Date: Tue, 6 Aug 2024 10:41:25 GMT
Title: Mitigating Text Toxicity with Counterfactual Generation
Authors: Milan Bhan, Jean-Noel Vittaut, Nina Achache, Victor Legrand, Nicolas Chesneau, Annabelle Blangero, Juliette Murris, Marie-Jeanne Lesot,
Abstract summary: Toxicity mitigation consists in rephrasing text in order to remove harmful meaning. Current methods fail to detoxify text while preserving the initial non-toxic meaning. This work is the first to bridge the gap between counterfactual generation and text detoxification.
Score: 0.3250512744763586
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Toxicity mitigation consists in rephrasing text in order to remove offensive or harmful meaning. Neural natural language processing (NLP) models have been widely used to target and mitigate textual toxicity. However, existing methods fail to detoxify text while preserving the initial non-toxic meaning at the same time. In this work, we propose to apply counterfactual generation methods from the eXplainable AI (XAI) field to target and mitigate textual toxicity. In particular, we perform text detoxification by applying local feature importance and counterfactual generation methods to a toxicity classifier distinguishing between toxic and non-toxic texts. We carry out text detoxification through counterfactual generation on three datasets and compare our approach to three competitors. Automatic and human evaluations show that recently developed NLP counterfactual generators can mitigate toxicity accurately while better preserving the meaning of the initial text as compared to classical detoxification methods. Finally, we take a step back from using automated detoxification tools, and discuss how to manage the polysemous nature of toxicity and the risk of malicious use of detoxification tools. This work is the first to bridge the gap between counterfactual generation and text detoxification and paves the way towards more practical application of XAI methods.

Related papers

Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models [14.566005698357747]
Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms.<n>We introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content.<n>Our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.
arXiv Detail & Related papers (2026-01-16T21:01:26Z)
<think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs [60.169913160819]
This paper explores the possibility of using synthetic toxic data as an alternative to human-generated data for training models for detoxification.<n>Experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data.<n>The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity.
arXiv Detail & Related papers (2025-09-10T07:48:24Z)
GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace [62.68664365246247]
This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs)<n>We propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN.
arXiv Detail & Related papers (2025-05-20T08:29:11Z)
Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost. FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z)
Parameter-Efficient Detoxification with Contrastive Decoding [78.5124331048714]
We introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles. During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step. We find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality.
arXiv Detail & Related papers (2024-01-13T01:46:20Z)
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer [36.60526586838288]
Recent large-scale Visual-Language Generative Models (VLGMs) have achieved unprecedented improvement in multimodal image/text generation. These models might also generate toxic content, e.g., offensive text and pornography images, raising significant ethical risks. This work delves into the propensity for toxicity generation and susceptibility to toxic data across various VLGMs.
arXiv Detail & Related papers (2023-12-13T08:25:07Z)
Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts [57.38912708076231]
We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods. MaRCo uses likelihoods under a non-toxic LM and a toxic LM to find candidate words to mask and potentially replace. We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $times$ more in human evaluation.
arXiv Detail & Related papers (2022-12-20T18:50:00Z)
Language Detoxification with Attribute-Discriminative Latent Space [59.167432249229584]
Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks. They can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications. We propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space.
arXiv Detail & Related papers (2022-10-19T06:54:42Z)
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection [33.715318646717385]
ToxiGen is a large-scale dataset of 274k toxic and benign statements about 13 minority groups. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale. We find that 94.5% of toxic examples are labeled as hate speech by human annotators.
arXiv Detail & Related papers (2022-03-17T17:57:56Z)
Mitigating Biases in Toxic Language Detection through Invariant Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns. Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z)
Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language. We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.