Language Detoxification with Attribute-Discriminative Latent Space
- URL: http://arxiv.org/abs/2210.10329v2
- Date: Wed, 5 Jul 2023 04:21:08 GMT
- Title: Language Detoxification with Attribute-Discriminative Latent Space
- Authors: Jin Myung Kwak, Minseon Kim and Sung Ju Hwang
- Abstract summary: Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks.
They can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications.
We propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space.
- Score: 59.167432249229584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based Language Models (LMs) have achieved impressive results on
natural language understanding tasks, but they can also generate toxic text
such as insults, threats, and profanity, limiting their real-world
applications. To overcome this issue, a few text generation approaches aim to
detoxify toxic texts using additional LMs or perturbations. However, previous
methods require excessive memory, computations, and time which are serious
bottlenecks in their real-world application. To address such limitations, we
propose an effective yet efficient method for language detoxification using an
attribute-discriminative latent space. Specifically, we project the latent
space of an original Transformer LM onto a discriminative latent space that
well-separates texts by their attributes using a projection block and an
attribute discriminator. This allows the LM to control the text generation to
be non-toxic with minimal memory and computation overhead. We validate our
model, Attribute-Discriminative Language Model (ADLM) on detoxified language
and dialogue generation tasks, on which our method significantly outperforms
baselines both in performance and efficiency.
Related papers
- Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - Linearly Controlled Language Generation with Performative Guarantees [9.487387238674721]
We use a common model of concept semantics as linearly represented in an LM's latent space.
We propose a lightweight, gradient-free intervention that dynamically steers trajectories away from regions corresponding to undesired meanings.
arXiv Detail & Related papers (2024-05-24T11:30:44Z) - DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion [16.989349884904943]
Current solutions involving finetuning or auxiliary models usually require extensive computational resources.
We propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs.
arXiv Detail & Related papers (2024-04-16T11:07:48Z) - Fine-Grained Detoxification via Instance-Level Prefixes for Large
Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost.
FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt.
We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z) - Parameter-Efficient Detoxification with Contrastive Decoding [78.5124331048714]
We introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles.
During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step.
We find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality.
arXiv Detail & Related papers (2024-01-13T01:46:20Z) - Unlocking Anticipatory Text Generation: A Constrained Approach for Large Language Models Decoding [75.06872859716049]
Large Language Models (LLMs) have demonstrated a powerful ability for text generation.
undesired behaviors such as toxicity or hallucinations can manifest.
We propose formalizing text generation as a future-constrained generation problem.
arXiv Detail & Related papers (2023-12-11T06:35:33Z) - Successor Features for Efficient Multisubject Controlled Text Generation [48.37713738712319]
We introduce SF-GEN, which is grounded in two primary concepts: successor features (SFs) and language model rectification.
SF-GEN seamlessly integrates the two to enable dynamic steering of text generation with no need to alter the LLM's parameters.
To the best of our knowledge, our research represents the first application of successor features in text generation.
arXiv Detail & Related papers (2023-11-03T00:17:08Z) - CFL: Causally Fair Language Models Through Token-level Attribute
Controlled Generation [5.210143170392524]
We propose a method to control the attributes of Language Models (LMs) for the text generation task using Causal Average Treatment Effect (ATE) scores and counterfactual augmentation.
We explore this method, in the context of LM detoxification, and propose the Causally Fair Language (CFL) architecture for detoxifying pre-trained LMs in a plug-and-play manner.
arXiv Detail & Related papers (2023-06-01T06:13:51Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.