Related papers: Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

URL: http://arxiv.org/abs/2309.00751v1
Date: Fri, 1 Sep 2023 22:26:06 GMT
Title: Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence
Authors: Daniel Scalena, Gabriele Sarti, Malvina Nissim, Elisabetta Fersini
Abstract summary: We apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence. We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification.
Score: 15.084940396969
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to language models' propensity to generate toxic or hateful responses, several techniques were developed to align model generations with users' preferences. Despite the effectiveness of such methods in improving the safety of model interactions, their impact on models' internal processes is still poorly understood. In this work, we apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence using feature attribution methods. We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification, observing differences in prompt reliance between the two methods despite their similar detoxification performances.

Related papers

The effect of fine-tuning on language model toxicity [7.539523407936451]
Fine-tuning language models has become increasingly popular following the proliferation of open models. We assess how fine-tuning can impact different open models' propensity to output toxic content. We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation can significantly alter these results.
arXiv Detail & Related papers (2024-10-21T09:39:09Z)
Representation Surgery: Theory and Practice of Affine Steering [72.61363182652853]
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. One natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations. This paper investigates the formal and empirical properties of steering functions.
arXiv Detail & Related papers (2024-02-15T00:20:30Z)
Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models [100.53662473219806]
Diffusion-of-Thought (DoT) is a novel approach that integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse over time through a diffusion language model. Our results demonstrate the effectiveness of DoT in multi-digit multiplication, logic, and grade school math problems.
arXiv Detail & Related papers (2024-02-12T16:23:28Z)
Revisiting Demonstration Selection Strategies in In-Context Learning [66.11652803887284]
Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL) In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent. We propose a data- and model-dependent demonstration selection method, textbfTopK + ConE, based on the assumption that textitthe performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples.
arXiv Detail & Related papers (2024-01-22T16:25:27Z)
Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets [46.19529338280716]
Language models, characterized by their black-box nature, often hallucinate and display sensitivity to input perturbations. We introduce a methodology designed to examine how input perturbations affect language models across various scales. We present three distinct fine-tuning strategies to address robustness against multiple perturbations.
arXiv Detail & Related papers (2023-11-15T02:59:10Z)
On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models. We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z)
CMD: a framework for Context-aware Model self-Detoxification [22.842468869653818]
Text detoxification aims to minimize the risk of language models producing toxic content. Existing detoxification methods fail to achieve a decent balance between detoxification effectiveness and generation quality. We introduce a Context-aware Model self-Detoxification(CMD) framework that pays attention to both the context and the detoxification process.
arXiv Detail & Related papers (2023-08-16T11:50:38Z)
Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models. We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z)
Detoxifying Language Models with a Toxic Corpus [16.7345472998388]
We propose to use toxic corpus as an additional resource to reduce the toxicity. Our result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially.
arXiv Detail & Related papers (2022-04-30T18:25:18Z)
Reward Modeling for Mitigating Toxicity in Transformer-based Language Models [0.0]
Transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks. Language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors. We propose Reinforce-Detoxify; A reinforcement learning-based method for mitigating toxicity in language models.
arXiv Detail & Related papers (2022-02-19T19:26:22Z)
ToxCCIn: Toxic Content Classification with Interpretability [16.153683223016973]
Explanations are important for tasks like offensive language or toxicity detection on social media. We propose a technique to improve the interpretability of transformer models, based on a simple and powerful assumption. We find this approach effective and can produce explanations that exceed the quality of those provided by Logistic Regression analysis.
arXiv Detail & Related papers (2021-03-01T22:17:10Z)
Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors. We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method. Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.