Related papers: Self-Detoxifying Language Models via Toxification Reversal

Self-Detoxifying Language Models via Toxification Reversal

URL: http://arxiv.org/abs/2310.09573v1
Date: Sat, 14 Oct 2023 12:51:38 GMT
Title: Self-Detoxifying Language Models via Toxification Reversal
Authors: Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, Wenjie Li
Abstract summary: Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs) We propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification" Our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content.
Score: 11.238212967733165
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs) for safer deployment. Existing methods can be roughly categorized as finetuning-based and decoding-based. However, the former is often resource-intensive, while the latter relies on additional components and potentially compromises the generation fluency. In this paper, we propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification". Our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content. At the same time, we are inspired by the recent research in the interpretability field, which formulates the evolving contextualized representations within the PLM as an information stream facilitated by the attention layers. Drawing on this idea, we devise a method to identify the toxification direction from the normal generation process to the one prompted with the negative prefix, and then steer the generation to the reversed direction by manipulating the information movement within the attention layers. Experimental results show that our approach, without any fine-tuning or extra components, can achieve comparable performance with state-of-the-art methods.

Related papers

Detoxification of Large Language Models through Output-layer Fusion with a Calibration Model [15.394714537797183]
Existing approaches for Large language model (LLM) detoxification rely on training on large-scale non-toxic or human-annotated preference data.<n>We propose a compact, pre-trained calibration model that guides the detoxification process of a target LLM via a lightweight intervention in its generation pipeline.
arXiv Detail & Related papers (2025-06-02T02:36:32Z)
Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing [49.85884082568318]
ToxEdit is a toxicity-aware knowledge editing approach.<n>It dynamically detects toxic activation patterns during forward propagation.<n>It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively.
arXiv Detail & Related papers (2025-05-28T12:37:06Z)
Risk-Aware Distributional Intervention Policies for Language Models [15.027122089807053]
Language models are prone to occasionally undesirable generations, such as harmful or toxic content. This paper presents a new two-stage approach to detect and mitigate undesirable content generations.
arXiv Detail & Related papers (2025-01-27T04:00:38Z)
Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs) SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z)
DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion [16.989349884904943]
Current solutions involving finetuning or auxiliary models usually require extensive computational resources. We propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs.
arXiv Detail & Related papers (2024-04-16T11:07:48Z)
Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs) We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z)
DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP) Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z)
Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost. FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z)
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models [25.212449683397647]
This paper studies the integration of a contrastive learning objective for fine-tuning LLMs for implicit knowledge editing and controlled text generation. To facilitate training the model in a self-supervised fashion, we leverage an off-the-shelf LLM for training data generation.
arXiv Detail & Related papers (2024-01-16T16:49:39Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
Leashing the Inner Demons: Self-Detoxification for Language Models [13.576289320208511]
Language models (LMs) can reproduce (or amplify) toxic language seen during training. We analyze the impact of prompts, decoding strategies and training corpora on the output. We propose a simple yet effective method for language models to "detoxify" themselves without an additional large corpus or external discriminator.
arXiv Detail & Related papers (2022-03-06T23:55:12Z)
A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff. cutoff relies on sampling consistency and thus adds little computational overhead. cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.