Self-Detoxifying Language Models via Toxification Reversal
- URL: http://arxiv.org/abs/2310.09573v1
- Date: Sat, 14 Oct 2023 12:51:38 GMT
- Title: Self-Detoxifying Language Models via Toxification Reversal
- Authors: Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, Wenjie Li
- Abstract summary: Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs)
We propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification"
Our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content.
- Score: 11.238212967733165
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Language model detoxification aims to minimize the risk of generating
offensive or harmful content in pretrained language models (PLMs) for safer
deployment. Existing methods can be roughly categorized as finetuning-based and
decoding-based. However, the former is often resource-intensive, while the
latter relies on additional components and potentially compromises the
generation fluency. In this paper, we propose a more lightweight approach that
enables the PLM itself to achieve "self-detoxification". Our method is built
upon the observation that prepending a negative steering prompt can effectively
induce PLMs to generate toxic content. At the same time, we are inspired by the
recent research in the interpretability field, which formulates the evolving
contextualized representations within the PLM as an information stream
facilitated by the attention layers. Drawing on this idea, we devise a method
to identify the toxification direction from the normal generation process to
the one prompted with the negative prefix, and then steer the generation to the
reversed direction by manipulating the information movement within the
attention layers. Experimental results show that our approach, without any
fine-tuning or extra components, can achieve comparable performance with
state-of-the-art methods.
Related papers
- Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion [16.989349884904943]
Current solutions involving finetuning or auxiliary models usually require extensive computational resources.
We propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs.
arXiv Detail & Related papers (2024-04-16T11:07:48Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z) - DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP)
Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z) - Fine-Grained Detoxification via Instance-Level Prefixes for Large
Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost.
FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt.
We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z) - Contrastive Perplexity for Controlled Generation: An Application in
Detoxifying Large Language Models [25.212449683397647]
This paper studies the integration of a contrastive learning objective for fine-tuning LLMs for implicit knowledge editing and controlled text generation.
To facilitate training the model in a self-supervised fashion, we leverage an off-the-shelf LLM for training data generation.
arXiv Detail & Related papers (2024-01-16T16:49:39Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Leashing the Inner Demons: Self-Detoxification for Language Models [13.576289320208511]
Language models (LMs) can reproduce (or amplify) toxic language seen during training.
We analyze the impact of prompts, decoding strategies and training corpora on the output.
We propose a simple yet effective method for language models to "detoxify" themselves without an additional large corpus or external discriminator.
arXiv Detail & Related papers (2022-03-06T23:55:12Z) - A Simple but Tough-to-Beat Data Augmentation Approach for Natural
Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff.
cutoff relies on sampling consistency and thus adds little computational overhead.
cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.