Steering Llama 2 via Contrastive Activation Addition
- URL: http://arxiv.org/abs/2312.06681v4
- Date: Fri, 5 Jul 2024 15:30:45 GMT
- Title: Steering Llama 2 via Contrastive Activation Addition
- Authors: Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner,
- Abstract summary: Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes.
CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
- Score: 41.54815073311959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).
Related papers
- Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data.
This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization.
Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z) - Fusing Dictionary Learning and Support Vector Machines for Unsupervised Anomaly Detection [1.5999407512883508]
We introduce a new anomaly detection model that unifies the OC-SVM and DL residual functions into a single composite objective.
We extend both objectives to the more general setting that allows the use of kernel functions.
arXiv Detail & Related papers (2024-04-05T12:41:53Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Activation Addition: Steering Language Models Without Optimization [40.04138190785384]
Activation engineering modifies activations at inference-time to predictably alter model behavior.
ActAdd takes far less compute and implementation effort than finetuning or RLHF.
Its computational overhead appears stable or improving over increasing model size.
arXiv Detail & Related papers (2023-08-20T12:21:05Z) - Feature Separation and Recalibration for Adversarial Robustness [18.975320671203132]
We propose a novel, easy-to- verify approach named Feature Separation and Recalibration.
It recalibrates the malicious, non-robust activations for more robust feature maps through Separation and Recalibration.
It improves the robustness of existing adversarial training methods by up to 8.57% with small computational overhead.
arXiv Detail & Related papers (2023-03-24T07:43:57Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - ContrastVAE: Contrastive Variational AutoEncoder for Sequential
Recommendation [58.02630582309427]
We propose to incorporate contrastive learning into the framework of Variational AutoEncoders.
We introduce ContrastELBO, a novel training objective that extends the conventional single-view ELBO to two-view case.
We also propose ContrastVAE, a two-branched VAE model with contrastive regularization as an embodiment of ContrastELBO for sequential recommendation.
arXiv Detail & Related papers (2022-08-27T03:35:00Z) - AAVAE: Augmentation-Augmented Variational Autoencoders [43.73699420145321]
We introduce augmentation-augmented variational autoencoders (AAVAE), a third approach to self-supervised learning based on autoencoding.
We empirically evaluate the proposed AAVAE on image classification, similar to how recent contrastive and non-contrastive learning algorithms have been evaluated.
arXiv Detail & Related papers (2021-07-26T17:04:30Z) - MCDAL: Maximum Classifier Discrepancy for Active Learning [74.73133545019877]
Recent state-of-the-art active learning methods have mostly leveraged Generative Adversarial Networks (GAN) for sample acquisition.
We propose in this paper a novel active learning framework that we call Maximum Discrepancy for Active Learning (MCDAL)
In particular, we utilize two auxiliary classification layers that learn tighter decision boundaries by maximizing the discrepancies among them.
arXiv Detail & Related papers (2021-07-23T06:57:08Z) - Dual Adversarial Auto-Encoders for Clustering [152.84443014554745]
We propose Dual Adversarial Auto-encoder (Dual-AAE) for unsupervised clustering.
By performing variational inference on the objective function of Dual-AAE, we derive a new reconstruction loss which can be optimized by training a pair of Auto-encoders.
Experiments on four benchmarks show that Dual-AAE achieves superior performance over state-of-the-art clustering methods.
arXiv Detail & Related papers (2020-08-23T13:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.