DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
- URL: http://arxiv.org/abs/2404.10464v3
- Date: Sat, 10 Aug 2024 14:17:59 GMT
- Title: DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
- Authors: Yu Li, Han Jiang, Chuanyang Gong, Zhihua Wei,
- Abstract summary: Current solutions involving finetuning or auxiliary models usually require extensive computational resources.
We propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs.
- Score: 16.989349884904943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving finetuning or auxiliary models usually require extensive computational resources, hindering their practicality in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs. Specifically, we derive detoxification vectors from self-induced, universal steering pairs through arithmetic operations in activation spaces. During inference, detoxification is achieved by fusing the detoxification vectors with the original representations in a head-wise manner. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on various metrics, while also maintaining satisfactory generation quality and diversity. We further validate the practicality and scalability of DeStein with a series of white-box LLMs. The method is open-sourced at https://github.com/LizLizLi/DeStein. Warning: Some example model outputs may contain highly offensive or disturbing text.
Related papers
- Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows [53.31856123113228]
This paper proposes Language Rectified Flow (ours)
Our method is based on the reformulation of the standard probabilistic flow models.
Experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
arXiv Detail & Related papers (2024-03-25T17:58:22Z) - AXOLOTL: Fairness through Assisted Self-Debiasing of Large Language
Model Outputs [20.772266479533776]
AXOLOTL is a novel post-processing framework that operates agnostically across tasks and models.
It identifies biases, proposes resolutions, and guides the model to self-debias its outputs.
This approach minimizes computational costs and preserves model performance.
arXiv Detail & Related papers (2024-03-01T00:02:37Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Improving Activation Steering in Language Models with Mean-Centring [10.101141087916133]
We find that taking the average of activations associated with a target dataset, and subtracting the mean of all training activations, results in effective steering vectors.
We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin.
arXiv Detail & Related papers (2023-12-06T18:27:07Z) - Self-Detoxifying Language Models via Toxification Reversal [11.238212967733165]
Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs)
We propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification"
Our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content.
arXiv Detail & Related papers (2023-10-14T12:51:38Z) - Augmented Language Models: a Survey [55.965967655575454]
This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools.
We refer to them as Augmented Language Models (ALMs)
The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks.
arXiv Detail & Related papers (2023-02-15T18:25:52Z) - Language Detoxification with Attribute-Discriminative Latent Space [59.167432249229584]
Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks.
They can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications.
We propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space.
arXiv Detail & Related papers (2022-10-19T06:54:42Z) - GeDi: Generative Discriminator Guided Sequence Generation [53.15651536569169]
We propose GeDi as an efficient method for using smaller LMs as generative discriminators to guide generation from large LMs.
We find that GeDi gives stronger controllability than the state of the art method while also achieving generation speeds more than 30 times faster.
arXiv Detail & Related papers (2020-09-14T17:45:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.