Let the Models Respond: Interpreting Language Model Detoxification
Through the Lens of Prompt Dependence
- URL: http://arxiv.org/abs/2309.00751v1
- Date: Fri, 1 Sep 2023 22:26:06 GMT
- Title: Let the Models Respond: Interpreting Language Model Detoxification
Through the Lens of Prompt Dependence
- Authors: Daniel Scalena, Gabriele Sarti, Malvina Nissim, Elisabetta Fersini
- Abstract summary: We apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence.
We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification.
- Score: 15.084940396969
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to language models' propensity to generate toxic or hateful responses,
several techniques were developed to align model generations with users'
preferences. Despite the effectiveness of such methods in improving the safety
of model interactions, their impact on models' internal processes is still
poorly understood. In this work, we apply popular detoxification approaches to
several language models and quantify their impact on the resulting models'
prompt dependence using feature attribution methods. We evaluate the
effectiveness of counter-narrative fine-tuning and compare it with
reinforcement learning-driven detoxification, observing differences in prompt
reliance between the two methods despite their similar detoxification
performances.
Related papers
- The effect of fine-tuning on language model toxicity [7.539523407936451]
Fine-tuning language models has become increasingly popular following the proliferation of open models.
We assess how fine-tuning can impact different open models' propensity to output toxic content.
We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation can significantly alter these results.
arXiv Detail & Related papers (2024-10-21T09:39:09Z) - Representation Surgery: Theory and Practice of Affine Steering [72.61363182652853]
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text.
One natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations.
This paper investigates the formal and empirical properties of steering functions.
arXiv Detail & Related papers (2024-02-15T00:20:30Z) - Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models [100.53662473219806]
Diffusion-of-Thought (DoT) is a novel approach that integrates diffusion models with Chain-of-Thought.
DoT allows reasoning steps to diffuse over time through a diffusion language model.
Our results demonstrate the effectiveness of DoT in multi-digit multiplication, logic, and grade school math problems.
arXiv Detail & Related papers (2024-02-12T16:23:28Z) - Revisiting Demonstration Selection Strategies in In-Context Learning [66.11652803887284]
Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL)
In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent.
We propose a data- and model-dependent demonstration selection method, textbfTopK + ConE, based on the assumption that textitthe performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples.
arXiv Detail & Related papers (2024-01-22T16:25:27Z) - Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets [46.19529338280716]
Language models, characterized by their black-box nature, often hallucinate and display sensitivity to input perturbations.
We introduce a methodology designed to examine how input perturbations affect language models across various scales.
We present three distinct fine-tuning strategies to address robustness against multiple perturbations.
arXiv Detail & Related papers (2023-11-15T02:59:10Z) - On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based
Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models.
We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z) - CMD: a framework for Context-aware Model self-Detoxification [22.842468869653818]
Text detoxification aims to minimize the risk of language models producing toxic content.
Existing detoxification methods fail to achieve a decent balance between detoxification effectiveness and generation quality.
We introduce a Context-aware Model self-Detoxification(CMD) framework that pays attention to both the context and the detoxification process.
arXiv Detail & Related papers (2023-08-16T11:50:38Z) - Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models.
We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z) - Detoxifying Language Models with a Toxic Corpus [16.7345472998388]
We propose to use toxic corpus as an additional resource to reduce the toxicity.
Our result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially.
arXiv Detail & Related papers (2022-04-30T18:25:18Z) - Reward Modeling for Mitigating Toxicity in Transformer-based Language
Models [0.0]
Transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks.
Language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors.
We propose Reinforce-Detoxify; A reinforcement learning-based method for mitigating toxicity in language models.
arXiv Detail & Related papers (2022-02-19T19:26:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.