CFL: Causally Fair Language Models Through Token-level Attribute
Controlled Generation
- URL: http://arxiv.org/abs/2306.00374v1
- Date: Thu, 1 Jun 2023 06:13:51 GMT
- Title: CFL: Causally Fair Language Models Through Token-level Attribute
Controlled Generation
- Authors: Rahul Madhavan, Rishabh Garg, Kahini Wadhawan and Sameep Mehta
- Abstract summary: We propose a method to control the attributes of Language Models (LMs) for the text generation task using Causal Average Treatment Effect (ATE) scores and counterfactual augmentation.
We explore this method, in the context of LM detoxification, and propose the Causally Fair Language (CFL) architecture for detoxifying pre-trained LMs in a plug-and-play manner.
- Score: 5.210143170392524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a method to control the attributes of Language Models (LMs) for
the text generation task using Causal Average Treatment Effect (ATE) scores and
counterfactual augmentation. We explore this method, in the context of LM
detoxification, and propose the Causally Fair Language (CFL) architecture for
detoxifying pre-trained LMs in a plug-and-play manner. Our architecture is
based on a Structural Causal Model (SCM) that is mathematically transparent and
computationally efficient as compared with many existing detoxification
techniques. We also propose several new metrics that aim to better understand
the behaviour of LMs in the context of toxic text generation. Further, we
achieve state of the art performance for toxic degeneration, which are computed
using \RTP (RTP) benchmark. Our experiments show that CFL achieves such a
detoxification without much impact on the model perplexity. We also show that
CFL mitigates the unintended bias problem through experiments on the BOLD
dataset.
Related papers
- R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models [83.77114091471822]
Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML)
A challenge in SFL, particularly when deployed over wireless channels, is the susceptibility of transmitted model parameters to adversarial jamming.
This is particularly pronounced for word embedding parameters in large language models (LLMs), which are crucial for language understanding.
A physical layer framework is developed for resilient SFL with LLMs (R-SFLLM) over wireless networks.
arXiv Detail & Related papers (2024-07-16T12:21:29Z) - Revisiting Catastrophic Forgetting in Large Language Model Tuning [79.70722658190097]
Catastrophic Forgetting (CF) means models forgetting previously acquired knowledge when learning new data.
This paper takes the first step to reveal the direct link between the flatness of the model loss landscape and the extent of CF in the field of large language models.
Experiments on three widely-used fine-tuning datasets, spanning different model scales, demonstrate the effectiveness of our method in alleviating CF.
arXiv Detail & Related papers (2024-06-07T11:09:13Z) - DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion [16.989349884904943]
We propose DeStein, a novel method that detoxififies language models.
We leverage self-induced steering pairs to identify detoxification vectors.
During inference, detoxification is achieved by blending the detoxification vectors with the original representations.
arXiv Detail & Related papers (2024-04-16T11:07:48Z) - Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows [53.31856123113228]
This paper proposes Language Rectified Flow (ours)
Our method is based on the reformulation of the standard probabilistic flow models.
Experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
arXiv Detail & Related papers (2024-03-25T17:58:22Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z) - Self-Detoxifying Language Models via Toxification Reversal [11.238212967733165]
Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs)
We propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification"
Our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content.
arXiv Detail & Related papers (2023-10-14T12:51:38Z) - Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks.
We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset.
The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z) - CMD: a framework for Context-aware Model self-Detoxification [25.02108563221933]
Text detoxification aims to minimize the risk of language models producing toxic content.
Existing detoxification methods fail to achieve a decent balance between detoxification effectiveness and generation quality.
We introduce a Context-aware Model self-Detoxification(CMD) framework that pays attention to both the context and the detoxification process.
arXiv Detail & Related papers (2023-08-16T11:50:38Z) - Language Detoxification with Attribute-Discriminative Latent Space [59.167432249229584]
Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks.
They can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications.
We propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space.
arXiv Detail & Related papers (2022-10-19T06:54:42Z) - Leashing the Inner Demons: Self-Detoxification for Language Models [13.576289320208511]
Language models (LMs) can reproduce (or amplify) toxic language seen during training.
We analyze the impact of prompts, decoding strategies and training corpora on the output.
We propose a simple yet effective method for language models to "detoxify" themselves without an additional large corpus or external discriminator.
arXiv Detail & Related papers (2022-03-06T23:55:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.