Steering Language Models With Activation Engineering
- URL: http://arxiv.org/abs/2308.10248v5
- Date: Thu, 10 Oct 2024 13:20:13 GMT
- Title: Steering Language Models With Activation Engineering
- Authors: Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid,
- Abstract summary: We introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs.
We achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT.
ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks.
- Score: 40.04138190785384
- License:
- Abstract: Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.
Related papers
- Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors [8.761404991620285]
Activation intervention has emerged as an effective and economical method to modify the behavior of large language models (LLMs)
We propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time.
Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training.
arXiv Detail & Related papers (2024-10-16T06:58:49Z) - Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly.
We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion.
Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z) - Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings.
We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa.
Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Steering Llama 2 via Contrastive Activation Addition [41.54815073311959]
Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes.
CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
arXiv Detail & Related papers (2023-12-09T04:40:46Z) - Improving Activation Steering in Language Models with Mean-Centring [10.101141087916133]
We find that taking the average of activations associated with a target dataset, and subtracting the mean of all training activations, results in effective steering vectors.
We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin.
arXiv Detail & Related papers (2023-12-06T18:27:07Z) - Swim: A General-Purpose, High-Performing, and Efficient Activation
Function for Locomotion Control Tasks [0.2538209532048866]
Activation functions play a significant role in the performance of deep learning algorithms.
In particular, the Swish activation function tends to outperform ReLU on deeper models.
We propose Swim, a general-purpose, efficient, and high-performing alternative to Swish.
arXiv Detail & Related papers (2023-03-05T11:04:33Z) - Transformers with Learnable Activation Functions [63.98696070245065]
We use Rational Activation Function (RAF) to learn optimal activation functions during training according to input data.
RAF opens a new research direction for analyzing and interpreting pre-trained models according to the learned activation functions.
arXiv Detail & Related papers (2022-08-30T09:47:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.