Probe-Free Low-Rank Activation Intervention
- URL: http://arxiv.org/abs/2502.04043v1
- Date: Thu, 06 Feb 2025 13:03:05 GMT
- Title: Probe-Free Low-Rank Activation Intervention
- Authors: Chonghe Jiang, Bao Nguyen, Anthony Man-Cho So, Viet Anh Nguyen,
- Abstract summary: Inference-time interventions that edit the hidden activations have shown promising results in steering the LMs towards desirable generations.
This paper proposes a probe-free intervention method FLORAIN for all attention heads in a specific activation layer.
- Score: 26.502232859901167
- License:
- Abstract: Language models (LMs) can produce texts that appear accurate and coherent but contain untruthful or toxic content. Inference-time interventions that edit the hidden activations have shown promising results in steering the LMs towards desirable generations. Existing activation intervention methods often comprise an activation probe to detect undesirable generation, triggering the activation modification to steer subsequent generation. This paper proposes a probe-free intervention method FLORAIN for all attention heads in a specific activation layer. It eliminates the need to train classifiers for probing purposes. The intervention function is parametrized by a sample-wise nonlinear low-rank mapping, which is trained by minimizing the distance between the modified activations and their projection onto the manifold of desirable content. Under specific constructions of the manifold and projection distance, we show that the intervention strategy can be computed efficiently by solving a smooth optimization problem. The empirical results, benchmarked on multiple base models, demonstrate that FLORAIN consistently outperforms several baseline methods in enhancing model truthfulness and quality across generation and multiple-choice tasks.
Related papers
- Task-driven Layerwise Additive Activation Intervention [12.152228552335798]
Modern language models (LMs) have significantly advanced generative modeling in natural language processing (NLP)
This paper proposes a layer-wise additive activation intervention framework that optimize the intervention process.
We benchmark our framework on various datasets, demonstrating improvements in the accuracy of pre-trained LMs and competing intervention baselines.
arXiv Detail & Related papers (2025-02-10T02:49:46Z) - Joint Localization and Activation Editing for Low-Resource Fine-Tuning [73.64004083269424]
We propose a joint localization and activation editing (JoLA) method.
JoLA learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves.
Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods.
arXiv Detail & Related papers (2025-02-03T09:13:09Z) - Risk-Aware Distributional Intervention Policies for Language Models [15.027122089807053]
Language models are prone to occasionally undesirable generations, such as harmful or toxic content.
This paper presents a new two-stage approach to detect and mitigate undesirable content generations.
arXiv Detail & Related papers (2025-01-27T04:00:38Z) - Exploring Embedding Priors in Prompt-Tuning for Improved Interpretability and Control [0.0]
We investigate how crucial the phenomenon of embedding collapse, frequently observed in Prompt-Tuning, is for the final performance of the model.
Our findings suggest that priors strongly affect the position of the tuned embeddings, and models can effectively work with embeddings from different parts of activation spaces.
arXiv Detail & Related papers (2024-12-24T18:18:52Z) - Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows [53.31856123113228]
This paper proposes Language Rectified Flow (ours)
Our method is based on the reformulation of the standard probabilistic flow models.
Experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
arXiv Detail & Related papers (2024-03-25T17:58:22Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - VRA: Variational Rectified Activation for Out-of-distribution Detection [45.804178022641764]
Out-of-distribution (OOD) detection is critical to building reliable machine learning systems in the open world.
ReAct is a typical and effective technique to deal with model overconfidence, which truncates high activations to increase the gap between in-distribution and OOD.
We propose a novel technique called Variational Rectified Activation (VRA)'', which simulates these suppression and amplification operations using piecewise functions.
arXiv Detail & Related papers (2023-02-23T00:45:14Z) - Active Learning for Optimal Intervention Design in Causal Models [11.294389953686945]
We develop a causal active learning strategy to identify interventions that are optimal, as measured by the discrepancy between the post-interventional mean of the distribution and a desired target mean.
We apply our approach to both synthetic data and single-cell transcriptomic data from Perturb-CITE-seq experiments to identify optimal perturbations that induce a specific cell state transition.
arXiv Detail & Related papers (2022-09-10T20:40:30Z) - Training Discrete Deep Generative Models via Gapped Straight-Through
Estimator [72.71398034617607]
We propose a Gapped Straight-Through ( GST) estimator to reduce the variance without incurring resampling overhead.
This estimator is inspired by the essential properties of Straight-Through Gumbel-Softmax.
Experiments demonstrate that the proposed GST estimator enjoys better performance compared to strong baselines on two discrete deep generative modeling tasks.
arXiv Detail & Related papers (2022-06-15T01:46:05Z) - MCDAL: Maximum Classifier Discrepancy for Active Learning [74.73133545019877]
Recent state-of-the-art active learning methods have mostly leveraged Generative Adversarial Networks (GAN) for sample acquisition.
We propose in this paper a novel active learning framework that we call Maximum Discrepancy for Active Learning (MCDAL)
In particular, we utilize two auxiliary classification layers that learn tighter decision boundaries by maximizing the discrepancies among them.
arXiv Detail & Related papers (2021-07-23T06:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.