Related papers: Activation Addition: Steering Language Models Without Optimization

Activation Addition: Steering Language Models Without Optimization

URL: http://arxiv.org/abs/2308.10248v4
Date: Tue, 4 Jun 2024 10:08:39 GMT
Title: Activation Addition: Steering Language Models Without Optimization
Authors: Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid,
Abstract summary: Activation engineering modifies activations at inference-time to predictably alter model behavior. ActAdd takes far less compute and implementation effort than finetuning or RLHF. Its computational overhead appears stable or improving over increasing model size.
Score: 40.04138190785384
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implicitly specified through natural language. Past work learned these steering vectors; our Activation Addition (ActAdd) method instead computes them by taking activation differences resulting from pairs of prompts. We demonstrate ActAdd on a range of LLMs (LLaMA-3, OPT, GPT-2, and GPT-J), obtaining SOTA on detoxification and negative-to-positive sentiment control. Our approach yields inference-time control over high-level properties of output like topic and sentiment while preserving performance on off-target tasks. ActAdd takes far less compute and implementation effort than finetuning or RLHF, allows users control through natural language, and its computational overhead (as a fraction of inference time) appears stable or improving over increasing model size.

Related papers

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
Scaling laws for activation steering with Llama 2 models and refusal mechanisms [0.13194391758295113]
CAA works by finding desirable 'directions' in the model's residual stream vector space using contrastive pairs.<n>This paper explores the effectiveness of CAA with model scale using the family of Llama 2 models (7B, 13B, and 70B)
arXiv Detail & Related papers (2025-07-15T22:21:18Z)
Detecting Informative Channels: ActionFormer [3.1976901430982063]
ActionFormer gives us additional outputs which detect the border of the activities as well as the activity labels.<n>We analyze this extensively in terms of deep learning architectures.<n>Our method achieves substantial improvement of a 16.01% in terms of average mAP for inertial data.
arXiv Detail & Related papers (2025-05-27T05:29:02Z)
Joint Localization and Activation Editing for Low-Resource Fine-Tuning [73.64004083269424]
We propose a joint localization and activation editing (JoLA) method.<n>JoLA learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves.<n>We demonstrate that JoLA consistently outperforms existing methods.
arXiv Detail & Related papers (2025-02-03T09:13:09Z)
Interpretable Steering of Large Language Models with Feature Guided Activation Additions [4.496738719682736]
We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method. By operating in the latent space of a Sparse Autoencoder (SAE), FGAA constructs precise steering vectors. evaluations on Gemma-2-2B and Gemma-2-9B models demonstrate that FGAA outperforms existing steering methods.
arXiv Detail & Related papers (2025-01-17T02:55:23Z)
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors [8.761404991620285]
Activation intervention has emerged as an effective and economical method to modify the behavior of large language models (LLMs) We propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time. Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training.
arXiv Detail & Related papers (2024-10-16T06:58:49Z)
Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z)
Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa. Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z)
Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Steering Llama 2 via Contrastive Activation Addition [41.54815073311959]
Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
arXiv Detail & Related papers (2023-12-09T04:40:46Z)
Improving Activation Steering in Language Models with Mean-Centring [10.101141087916133]
We find that taking the average of activations associated with a target dataset, and subtracting the mean of all training activations, results in effective steering vectors. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin.
arXiv Detail & Related papers (2023-12-06T18:27:07Z)
Swim: A General-Purpose, High-Performing, and Efficient Activation Function for Locomotion Control Tasks [0.2538209532048866]
Activation functions play a significant role in the performance of deep learning algorithms. In particular, the Swish activation function tends to outperform ReLU on deeper models. We propose Swim, a general-purpose, efficient, and high-performing alternative to Swish.
arXiv Detail & Related papers (2023-03-05T11:04:33Z)
Transformers with Learnable Activation Functions [63.98696070245065]
We use Rational Activation Function (RAF) to learn optimal activation functions during training according to input data. RAF opens a new research direction for analyzing and interpreting pre-trained models according to the learned activation functions.
arXiv Detail & Related papers (2022-08-30T09:47:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.