Risk-Aware Distributional Intervention Policies for Language Models
- URL: http://arxiv.org/abs/2501.15758v1
- Date: Mon, 27 Jan 2025 04:00:38 GMT
- Title: Risk-Aware Distributional Intervention Policies for Language Models
- Authors: Bao Nguyen, Binh Nguyen, Duy Nguyen, Viet Anh Nguyen,
- Abstract summary: Language models are prone to occasionally undesirable generations, such as harmful or toxic content.
This paper presents a new two-stage approach to detect and mitigate undesirable content generations.
- Score: 15.027122089807053
- License:
- Abstract: Language models are prone to occasionally undesirable generations, such as harmful or toxic content, despite their impressive capability to produce texts that appear accurate and coherent. This paper presents a new two-stage approach to detect and mitigate undesirable content generations by rectifying activations. First, we train an ensemble of layerwise classifiers to detect undesirable content using activations by minimizing a smooth surrogate of the risk-aware score. Then, for contents that are detected as undesirable, we propose layerwise distributional intervention policies that perturb the attention heads minimally while guaranteeing probabilistically the effectiveness of the intervention. Benchmarks on several language models and datasets show that our method outperforms baselines in reducing the generation of undesirable output.
Related papers
- Probe-Free Low-Rank Activation Intervention [26.502232859901167]
Inference-time interventions that edit the hidden activations have shown promising results in steering the LMs towards desirable generations.
This paper proposes a probe-free intervention method FLORAIN for all attention heads in a specific activation layer.
arXiv Detail & Related papers (2025-02-06T13:03:05Z) - Attacks against Abstractive Text Summarization Models through Lead Bias and Influence Functions [1.7863534204867277]
Large Language Models are vulnerable to adversarial perturbations and data poisoning attacks.
In this work, we unveil a novel approach by exploiting the inherent lead bias in summarization models.
We also introduce an innovative application of influence functions, to execute data poisoning, which compromises the model's integrity.
arXiv Detail & Related papers (2024-10-26T00:35:15Z) - Linearly Controlled Language Generation with Performative Guarantees [9.487387238674721]
We use a common model of concept semantics as linearly represented in an LM's latent space.
We propose a lightweight, gradient-free intervention that dynamically steers trajectories away from regions corresponding to undesired meanings.
arXiv Detail & Related papers (2024-05-24T11:30:44Z) - DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP)
Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Self-Detoxifying Language Models via Toxification Reversal [11.238212967733165]
Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs)
We propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification"
Our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content.
arXiv Detail & Related papers (2023-10-14T12:51:38Z) - Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z) - Sentence Representation Learning with Generative Objective rather than
Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction.
Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z) - A Simple Contrastive Learning Objective for Alleviating Neural Text
Degeneration [56.64703901898937]
We propose a new contrastive token learning objective that inherits the advantages of cross-entropy and unlikelihood training.
Comprehensive experiments on language modeling and open-domain dialogue generation tasks show that the proposed contrastive token objective yields less repetitive texts.
arXiv Detail & Related papers (2022-05-05T08:50:50Z) - Contextualized Perturbation for Textual Adversarial Attack [56.370304308573274]
Adversarial examples expose the vulnerabilities of natural language processing (NLP) models.
This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs.
arXiv Detail & Related papers (2020-09-16T06:53:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.