Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective
- URL: http://arxiv.org/abs/2409.10053v1
- Date: Mon, 16 Sep 2024 07:29:40 GMT
- Title: Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective
- Authors: Van-Cuong Pham, Thien Huu Nguyen,
- Abstract summary: We propose a novel editing method that views activations in terms of their directions and magnitudes.
Our method, named Householder Pseudo-Rotation (HPR), mimics the rotation transformation, thus preserving activation norms and resulting in an improved performance on various safety benchmarks.
- Score: 32.12158544745983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Activation Editing, which involves directly editting the internal representations of large language models (LLMs) to alter their behaviors and achieve desired properties, has emerged as a promising area of research. Existing works primarily treat LLMs' activations as points in space and modify them by adding steering vectors. However, this approach is limited in its ability to achieve greater performance improvement while maintaining the necessary consistency of activation magnitudes. To overcome these issues, we propose a novel editing method that views activations in terms of their directions and magnitudes. Our method, named Householder Pseudo-Rotation (HPR), mimics the rotation transformation, thus preserving activation norms and resulting in an improved performance on various safety benchmarks.
Related papers
- Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors [8.761404991620285]
Activation intervention has emerged as an effective and economical method to modify the behavior of large language models (LLMs)
We propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time.
Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training.
arXiv Detail & Related papers (2024-10-16T06:58:49Z) - Activation-Descent Regularization for Input Optimization of ReLU Networks [14.855585146315741]
We present a new approach for input optimization of ReLU networks that explicitly takes into account the effect of changes in activation patterns.
We convert the discrete space of activation patterns into differentiable representations and propose regularization terms that improve each descent step.
Our experiments demonstrate the effectiveness of the proposed input-optimization methods for improving the state-of-the-art in various areas.
arXiv Detail & Related papers (2024-06-01T16:46:46Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data.
This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization.
Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Audio-Adaptive Activity Recognition Across Video Domains [112.46638682143065]
We leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening.
We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation.
We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically.
arXiv Detail & Related papers (2022-03-27T08:15:20Z) - Transformers in Action:Weakly Supervised Action Segmentation [81.18941007536468]
We show how to apply transformers to improve action alignment accuracy over the equivalent RNN-based models.
We also propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time.
We evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers.
arXiv Detail & Related papers (2022-01-14T21:15:58Z) - Activation Modulation and Recalibration Scheme for Weakly Supervised
Semantic Segmentation [24.08326440298189]
We propose a novel activation modulation and recalibration scheme for weakly supervised semantic segmentation.
We show that AMR establishes a new state-of-the-art performance on the PASCAL VOC 2012 dataset.
Experiments also reveal that our scheme is plug-and-play and can be incorporated with other approaches to boost their performance.
arXiv Detail & Related papers (2021-12-16T16:26:14Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.