Self-Detoxifying Language Models via Toxification Reversal
- URL: http://arxiv.org/abs/2310.09573v1
- Date: Sat, 14 Oct 2023 12:51:38 GMT
- Title: Self-Detoxifying Language Models via Toxification Reversal
- Authors: Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, Wenjie Li
- Abstract summary: Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs)
We propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification"
Our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content.
- Score: 11.238212967733165
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Language model detoxification aims to minimize the risk of generating
offensive or harmful content in pretrained language models (PLMs) for safer
deployment. Existing methods can be roughly categorized as finetuning-based and
decoding-based. However, the former is often resource-intensive, while the
latter relies on additional components and potentially compromises the
generation fluency. In this paper, we propose a more lightweight approach that
enables the PLM itself to achieve "self-detoxification". Our method is built
upon the observation that prepending a negative steering prompt can effectively
induce PLMs to generate toxic content. At the same time, we are inspired by the
recent research in the interpretability field, which formulates the evolving
contextualized representations within the PLM as an information stream
facilitated by the attention layers. Drawing on this idea, we devise a method
to identify the toxification direction from the normal generation process to
the one prompted with the negative prefix, and then steer the generation to the
reversed direction by manipulating the information movement within the
attention layers. Experimental results show that our approach, without any
fine-tuning or extra components, can achieve comparable performance with
state-of-the-art methods.
Related papers
- DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion [16.989349884904943]
We propose DeStein, a novel method that detoxififies language models.
We leverage self-induced steering pairs to identify detoxification vectors.
During inference, detoxification is achieved by blending the detoxification vectors with the original representations.
arXiv Detail & Related papers (2024-04-16T11:07:48Z) - Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows [53.31856123113228]
This paper proposes Language Rectified Flow (ours)
Our method is based on the reformulation of the standard probabilistic flow models.
Experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
arXiv Detail & Related papers (2024-03-25T17:58:22Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z) - DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP)
Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z) - Fine-Grained Detoxification via Instance-Level Prefixes for Large
Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost.
FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt.
We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z) - Contrastive Perplexity for Controlled Generation: An Application in
Detoxifying Large Language Models [25.212449683397647]
This paper studies the integration of a contrastive learning objective for fine-tuning LLMs for implicit knowledge editing and controlled text generation.
To facilitate training the model in a self-supervised fashion, we leverage an off-the-shelf LLM for training data generation.
arXiv Detail & Related papers (2024-01-16T16:49:39Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - CMD: a framework for Context-aware Model self-Detoxification [25.02108563221933]
Text detoxification aims to minimize the risk of language models producing toxic content.
Existing detoxification methods fail to achieve a decent balance between detoxification effectiveness and generation quality.
We introduce a Context-aware Model self-Detoxification(CMD) framework that pays attention to both the context and the detoxification process.
arXiv Detail & Related papers (2023-08-16T11:50:38Z) - Leashing the Inner Demons: Self-Detoxification for Language Models [13.576289320208511]
Language models (LMs) can reproduce (or amplify) toxic language seen during training.
We analyze the impact of prompts, decoding strategies and training corpora on the output.
We propose a simple yet effective method for language models to "detoxify" themselves without an additional large corpus or external discriminator.
arXiv Detail & Related papers (2022-03-06T23:55:12Z) - A Simple but Tough-to-Beat Data Augmentation Approach for Natural
Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff.
cutoff relies on sampling consistency and thus adds little computational overhead.
cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.