Scaling laws for activation steering with Llama 2 models and refusal mechanisms
- URL: http://arxiv.org/abs/2507.11771v1
- Date: Tue, 15 Jul 2025 22:21:18 GMT
- Title: Scaling laws for activation steering with Llama 2 models and refusal mechanisms
- Authors: Sheikh Abdur Raheem Ali, Justin Xu, Ivory Yang, Jasmine Xinze Li, Ayse Arslan, Clark Benham,
- Abstract summary: CAA works by finding desirable 'directions' in the model's residual stream vector space using contrastive pairs.<n>This paper explores the effectiveness of CAA with model scale using the family of Llama 2 models (7B, 13B, and 70B)
- Score: 0.13194391758295113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) evolve in complexity and capability, the efficacy of less widely deployed alignment techniques are uncertain. Building on previous work on activation steering and contrastive activation addition (CAA), this paper explores the effectiveness of CAA with model scale using the family of Llama 2 models (7B, 13B, and 70B). CAA works by finding desirable 'directions' in the model's residual stream vector space using contrastive pairs (for example, hate to love) and adding this direction to the residual stream during the forward pass. It directly manipulates the residual stream and aims to extract features from language models to better control their outputs. Using answer matching questions centered around the refusal behavior, we found that 1) CAA is most effective when applied at early-mid layers. 2) The effectiveness of CAA diminishes with model size. 3) Negative steering has more pronounced effects than positive steering across all model sizes.
Related papers
- GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - Normalized Attention Guidance: Universal Negative Guidance for Diffusion Models [57.20761595019967]
We present Normalized Attention Guidance (NAG), an efficient, training-free mechanism that applies extrapolation in attention space with L1-based normalization and refinement.<n>NAG restores effective negative guidance where CFG collapses while maintaining fidelity.<n>NAG generalizes across architectures (UNet, DiT), sampling regimes (few-step, multi-step), and modalities (image, video)
arXiv Detail & Related papers (2025-05-27T13:30:46Z) - Improved Representation Steering for Language Models [50.86411958644953]
We show how to improve representation steering via our new Reference-free Preference Steering (RePS)<n>On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective.<n>In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants.
arXiv Detail & Related papers (2025-05-27T07:16:40Z) - Patterns and Mechanisms of Contrastive Activation Engineering [0.374490703387131]
CAE has the potential to introduce a new paradigm of flexible, task-specific behavior tuning.<n>We analyze the performance of CAE in in in-distribution, out-of-distribution settings, evaluate drawbacks, and begin to develop comprehensive guidelines for its effective deployment.
arXiv Detail & Related papers (2025-05-06T05:15:12Z) - Interpretable Steering of Large Language Models with Feature Guided Activation Additions [4.496738719682736]
We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method.<n>By operating in the latent space of a Sparse Autoencoder (SAE), FGAA constructs precise steering vectors.<n> evaluations on Gemma-2-2B and Gemma-2-9B models demonstrate that FGAA outperforms existing steering methods.
arXiv Detail & Related papers (2025-01-17T02:55:23Z) - Refusal in LLMs is an Affine Function [1.722461331472526]
We propose affine concept editing (ACE) as an approach for steering language models' behavior.<n>ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses.<n>Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods.
arXiv Detail & Related papers (2024-11-13T20:12:55Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Steering Llama 2 via Contrastive Activation Addition [41.54815073311959]
Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes.
CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
arXiv Detail & Related papers (2023-12-09T04:40:46Z) - Steering Language Models With Activation Engineering [40.04138190785384]
We introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs.
We achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT.
ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks.
arXiv Detail & Related papers (2023-08-20T12:21:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.