Related papers: Reward Modeling for Mitigating Toxicity in Transformer-based Language Models

Reward Modeling for Mitigating Toxicity in Transformer-based Language Models

URL: http://arxiv.org/abs/2202.09662v2
Date: Tue, 22 Feb 2022 20:20:33 GMT
Title: Reward Modeling for Mitigating Toxicity in Transformer-based Language Models
Authors: Farshid Faal and Ketra Schmitt
Abstract summary: Transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks. Language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors. We propose Reinforce-Detoxify; A reinforcement learning-based method for mitigating toxicity in language models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods were proposed to mitigate the language model's toxicity; however, these methods struggled to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify; A reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that is able to detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating the ability of our approach in language model detoxification and less prone to unintended bias toward social identities in generated content.

Related papers

Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification [31.7516400680833]
"Cross-lingual Detoxification" is a paradigm that mitigates toxicity in large language models.<n>We analyze toxicity reduction in cross-distribution settings and investigate how mitigation impacts model performance on non-toxic tasks.
arXiv Detail & Related papers (2025-05-22T14:30:14Z)
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation [6.781972039785424]
Recent generative large language models (LLMs) show remarkable performance in non-English languages. When prompted in those languages they tend to express higher harmful social biases and toxicity levels. We investigate the impact of different finetuning methods on the model's bias and toxicity, but also on its ability to produce fluent and diverse text.
arXiv Detail & Related papers (2024-12-18T17:05:08Z)
Recourse for reclamation: Chatting with generative language models [2.877217169371665]
We extend the concept of algorithmic recourse to generative language models. We provide users a novel mechanism to achieve their desired prediction by dynamically setting thresholds for toxicity filtering. A pilot study supports the potential of our proposed recourse mechanism.
arXiv Detail & Related papers (2024-03-21T15:14:25Z)
DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP) Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z)
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models [21.341749351654453]
The generation of toxic content by large language models (LLMs) remains a critical challenge for the safe deployment of language technology.<n>We propose a novel framework for implicit knowledge editing and controlled text generation by fine-tuning LLMs with a prototype-based contrastive perplexity objective.
arXiv Detail & Related papers (2024-01-16T16:49:39Z)
Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence [15.084940396969]
We apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence. We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification.
arXiv Detail & Related papers (2023-09-01T22:26:06Z)
CMD: a framework for Context-aware Model self-Detoxification [22.842468869653818]
Text detoxification aims to minimize the risk of language models producing toxic content. Existing detoxification methods fail to achieve a decent balance between detoxification effectiveness and generation quality. We introduce a Context-aware Model self-Detoxification(CMD) framework that pays attention to both the context and the detoxification process.
arXiv Detail & Related papers (2023-08-16T11:50:38Z)
Language Detoxification with Attribute-Discriminative Latent Space [59.167432249229584]
Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks. They can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications. We propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space.
arXiv Detail & Related papers (2022-10-19T06:54:42Z)
Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models [78.12943085697283]
Detoxification is a task of generating text in polite style while preserving meaning and fluency of the original toxic text. This work investigates multilingual and cross-lingual detoxification and the behavior of large multilingual models like in this setting.
arXiv Detail & Related papers (2022-06-05T20:02:30Z)
Detoxifying Language Models with a Toxic Corpus [16.7345472998388]
We propose to use toxic corpus as an additional resource to reduce the toxicity. Our result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially.
arXiv Detail & Related papers (2022-04-30T18:25:18Z)
Leashing the Inner Demons: Self-Detoxification for Language Models [13.576289320208511]
Language models (LMs) can reproduce (or amplify) toxic language seen during training. We analyze the impact of prompts, decoding strategies and training corpora on the output. We propose a simple yet effective method for language models to "detoxify" themselves without an additional large corpus or external discriminator.
arXiv Detail & Related papers (2022-03-06T23:55:12Z)
Mitigating Biases in Toxic Language Detection through Invariant Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns. Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z)
Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language. We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.