Parameter-Efficient Detoxification with Contrastive Decoding
- URL: http://arxiv.org/abs/2401.06947v1
- Date: Sat, 13 Jan 2024 01:46:20 GMT
- Title: Parameter-Efficient Detoxification with Contrastive Decoding
- Authors: Tong Niu, Caiming Xiong, Semih Yavuz, Yingbo Zhou
- Abstract summary: We introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles.
During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step.
We find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality.
- Score: 78.5124331048714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of natural language generation has witnessed significant
advancements in recent years, including the development of controllable text
generation techniques. However, controlling the attributes of the generated
text remains a challenge, especially when aiming to avoid undesirable behavior
such as toxicity. In this work, we introduce Detoxification Generator
(DETOXIGEN), an inference-time algorithm that steers the generation away from
unwanted styles. DETOXIGEN is an ensemble of a pre-trained language model
(generator) and a detoxifier. The detoxifier is trained intentionally on the
toxic data representative of the undesirable attribute, encouraging it to
generate text in that style exclusively. During the actual generation, we use
the trained detoxifier to produce undesirable tokens for the generator to
contrast against at each decoding step. This approach directly informs the
generator to avoid generating tokens that the detoxifier considers highly
likely. We evaluate DETOXIGEN on the commonly used REALTOXICITYPROMPTS
benchmark (Gehman et al., 2020) with various language models as generators. We
find that it significantly outperforms previous approaches in detoxification
metrics while not compromising on the generation quality. Moreover, the
detoxifier is obtained by soft prompt-tuning using the same backbone language
model as the generator. Hence, DETOXIGEN requires only a tiny amount of extra
weights from the virtual tokens of the detoxifier to be loaded into GPU memory
while decoding, making it a promising lightweight, practical, and
parameter-efficient detoxification strategy.
Related papers
- Mitigating Text Toxicity with Counterfactual Generation [0.3250512744763586]
Toxicity mitigation consists in rephrasing text in order to remove harmful meaning.
Current methods fail to detoxify text while preserving the initial non-toxic meaning.
This work is the first to bridge the gap between counterfactual generation and text detoxification.
arXiv Detail & Related papers (2024-05-16T09:52:21Z) - DetoxLLM: A Framework for Detoxification with Explanations [25.174878638472254]
We propose DetoxLLM, the first comprehensive end-to-end detoxification framework.
We first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies.
We show that our detoxification models outperform the SoTA model trained with human-annotated parallel corpus.
arXiv Detail & Related papers (2024-02-25T01:56:47Z) - Fine-Grained Detoxification via Instance-Level Prefixes for Large
Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost.
FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt.
We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z) - Detoxifying Text with MaRCo: Controllable Revision with Experts and
Anti-Experts [57.38912708076231]
We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods.
MaRCo uses likelihoods under a non-toxic LM and a toxic LM to find candidate words to mask and potentially replace.
We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $times$ more in human evaluation.
arXiv Detail & Related papers (2022-12-20T18:50:00Z) - Generating Sequences by Learning to Self-Correct [64.0249217590888]
Self-Correction decouples an imperfect base generator from a separate corrector that learns to iteratively correct imperfect generations.
We show that Self-Correction improves upon the base generator in three diverse generation tasks.
arXiv Detail & Related papers (2022-10-31T18:09:51Z) - Language Detoxification with Attribute-Discriminative Latent Space [59.167432249229584]
Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks.
They can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications.
We propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space.
arXiv Detail & Related papers (2022-10-19T06:54:42Z) - ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and
Implicit Hate Speech Detection [33.715318646717385]
ToxiGen is a large-scale dataset of 274k toxic and benign statements about 13 minority groups.
Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale.
We find that 94.5% of toxic examples are labeled as hate speech by human annotators.
arXiv Detail & Related papers (2022-03-17T17:57:56Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z) - GeDi: Generative Discriminator Guided Sequence Generation [53.15651536569169]
We propose GeDi as an efficient method for using smaller LMs as generative discriminators to guide generation from large LMs.
We find that GeDi gives stronger controllability than the state of the art method while also achieving generation speeds more than 30 times faster.
arXiv Detail & Related papers (2020-09-14T17:45:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.