Related papers: Steering Llama 2 via Contrastive Activation Addition

Steering Llama 2 via Contrastive Activation Addition

URL: http://arxiv.org/abs/2312.06681v4
Date: Fri, 5 Jul 2024 15:30:45 GMT
Title: Steering Llama 2 via Contrastive Activation Addition
Authors: Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner,
Abstract summary: Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
Score: 41.54815073311959
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).

Related papers

Steering Large Language Model Activations in Sparse Spaces [21.55545768931058]
A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. We introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer behavior in sparse spaces.
arXiv Detail & Related papers (2025-02-28T20:43:45Z)
Investigating Generalization of One-shot LLM Steering Vectors [21.2431937128876]
We propose optimizing steering vectors through gradient descent on a single training example. We find that the resulting vectors effectively mediate safety-relevant behaviors in multiple models.
arXiv Detail & Related papers (2025-02-26T06:13:01Z)
Multi-Attribute Steering of Language Models via Targeted Intervention [56.93583799109029]
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction. We introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes.
arXiv Detail & Related papers (2025-02-18T02:27:23Z)
Interpretable Steering of Large Language Models with Feature Guided Activation Additions [4.496738719682736]
We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method. By operating in the latent space of a Sparse Autoencoder (SAE), FGAA constructs precise steering vectors. evaluations on Gemma-2-2B and Gemma-2-9B models demonstrate that FGAA outperforms existing steering methods.
arXiv Detail & Related papers (2025-01-17T02:55:23Z)
Refusal in LLMs is an Affine Function [1.722461331472526]
We propose affine concept editing (ACE) as an approach for steering language models' behavior. ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses. Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods.
arXiv Detail & Related papers (2024-11-13T20:12:55Z)
Improving Steering Vectors by Targeting Sparse Autoencoder Features [2.4188584949331053]
We develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects. We show that SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
arXiv Detail & Related papers (2024-11-04T15:46:20Z)
Controlling Language and Diffusion Models by Transporting Activations [23.352500740697938]
We introduce Activation Transport (AcT), a framework to steer activations guided by optimal transport theory. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is)
arXiv Detail & Related papers (2024-10-30T14:21:33Z)
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors [8.761404991620285]
Activation intervention has emerged as an effective and economical method to modify the behavior of large language models (LLMs) We propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time. Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training.
arXiv Detail & Related papers (2024-10-16T06:58:49Z)
Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z)
Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa. Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Steering Language Models With Activation Engineering [40.04138190785384]
We introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. We achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks.
arXiv Detail & Related papers (2023-08-20T12:21:05Z)
ContrastVAE: Contrastive Variational AutoEncoder for Sequential Recommendation [58.02630582309427]
We propose to incorporate contrastive learning into the framework of Variational AutoEncoders. We introduce ContrastELBO, a novel training objective that extends the conventional single-view ELBO to two-view case. We also propose ContrastVAE, a two-branched VAE model with contrastive regularization as an embodiment of ContrastELBO for sequential recommendation.
arXiv Detail & Related papers (2022-08-27T03:35:00Z)
MCDAL: Maximum Classifier Discrepancy for Active Learning [74.73133545019877]
Recent state-of-the-art active learning methods have mostly leveraged Generative Adversarial Networks (GAN) for sample acquisition. We propose in this paper a novel active learning framework that we call Maximum Discrepancy for Active Learning (MCDAL) In particular, we utilize two auxiliary classification layers that learn tighter decision boundaries by maximizing the discrepancies among them.
arXiv Detail & Related papers (2021-07-23T06:57:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.