Steering Llama 2 via Contrastive Activation Addition
- URL: http://arxiv.org/abs/2312.06681v4
- Date: Fri, 5 Jul 2024 15:30:45 GMT
- Title: Steering Llama 2 via Contrastive Activation Addition
- Authors: Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner,
- Abstract summary: Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes.
CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
- Score: 41.54815073311959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).
Related papers
- Refusal in LLMs is an Affine Function [1.722461331472526]
We propose affine concept editing (ACE) as an approach for steering language models' behavior.
ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses.
Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods.
arXiv Detail & Related papers (2024-11-13T20:12:55Z) - Improving Steering Vectors by Targeting Sparse Autoencoder Features [2.4188584949331053]
We develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects.
We show that SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
arXiv Detail & Related papers (2024-11-04T15:46:20Z) - Controlling Language and Diffusion Models by Transporting Activations [23.352500740697938]
We introduce Activation Transport (AcT), a framework to steer activations guided by optimal transport theory.
We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is)
arXiv Detail & Related papers (2024-10-30T14:21:33Z) - Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors [8.761404991620285]
Activation intervention has emerged as an effective and economical method to modify the behavior of large language models (LLMs)
We propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time.
Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training.
arXiv Detail & Related papers (2024-10-16T06:58:49Z) - Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly.
We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion.
Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z) - Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings.
We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa.
Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Steering Language Models With Activation Engineering [40.04138190785384]
We introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs.
We achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT.
ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks.
arXiv Detail & Related papers (2023-08-20T12:21:05Z) - ContrastVAE: Contrastive Variational AutoEncoder for Sequential
Recommendation [58.02630582309427]
We propose to incorporate contrastive learning into the framework of Variational AutoEncoders.
We introduce ContrastELBO, a novel training objective that extends the conventional single-view ELBO to two-view case.
We also propose ContrastVAE, a two-branched VAE model with contrastive regularization as an embodiment of ContrastELBO for sequential recommendation.
arXiv Detail & Related papers (2022-08-27T03:35:00Z) - MCDAL: Maximum Classifier Discrepancy for Active Learning [74.73133545019877]
Recent state-of-the-art active learning methods have mostly leveraged Generative Adversarial Networks (GAN) for sample acquisition.
We propose in this paper a novel active learning framework that we call Maximum Discrepancy for Active Learning (MCDAL)
In particular, we utilize two auxiliary classification layers that learn tighter decision boundaries by maximizing the discrepancies among them.
arXiv Detail & Related papers (2021-07-23T06:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.