Related papers: Fine-Grained Activation Steering: Steering Less, Achieving More

Fine-Grained Activation Steering: Steering Less, Achieving More

URL: http://arxiv.org/abs/2602.04428v1
Date: Wed, 04 Feb 2026 11:03:01 GMT
Title: Fine-Grained Activation Steering: Steering Less, Achieving More
Authors: Zijian Feng, Tianjiao Li, Zixiao Zhu, Hanzhang Zhou, Junlang Qian, Li Zhang, Jia Jim Deryl Chua, Lee Onn Mak, Gee Wah Ng, Kezhi Mao,
Abstract summary: Activation steering has emerged as a cost-effective paradigm for modifying large language model (LLM) behaviors.<n>We show that block-level activations are inherently heterogeneous, entangling beneficial, irrelevant, and harmful features.<n>We propose AUSteer, a simple and efficient method that operates at a finer granularity of the AU level.
Score: 33.680685349571
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Activation steering has emerged as a cost-effective paradigm for modifying large language model (LLM) behaviors. Existing methods typically intervene at the block level, steering the bundled activations of selected attention heads, feedforward networks, or residual streams. However, we reveal that block-level activations are inherently heterogeneous, entangling beneficial, irrelevant, and harmful features, thereby rendering block-level steering coarse, inefficient, and intrusive. To investigate the root cause, we decompose block activations into fine-grained atomic unit (AU)-level activations, where each AU-level activation corresponds to a single dimension of the block activation, and each AU denotes a slice of the block weight matrix. Steering an AU-level activation is thus equivalent to steering its associated AU. Our theoretical and empirical analysis show that heterogeneity arises because different AUs or dimensions control distinct token distributions in LLM outputs. Hence, block-level steering inevitably moves helpful and harmful token directions together, which reduces efficiency. Restricting intervention to beneficial AUs yields more precise and effective steering. Building on this insight, we propose AUSteer, a simple and efficient method that operates at a finer granularity of the AU level. AUSteer first identifies discriminative AUs globally by computing activation momenta on contrastive samples. It then assigns adaptive steering strengths tailored to diverse inputs and selected AU activations. Comprehensive experiments on multiple LLMs and tasks show that AUSteer consistently surpasses advanced baselines while steering considerably fewer activations, demonstrating that steering less achieves more.

Related papers

ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment [49.68063561145927]
We propose a unified ordinary differential equations (ODEs)-based theoretical framework for activation steering.<n>We introduce ODESteer, a kind of ODE-based steering guided by barrier functions.<n>Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements.
arXiv Detail & Related papers (2026-02-19T17:13:44Z)
Mechanistic Indicators of Steering Effectiveness in Large Language Models [3.635648354808971]
Activation-based steering enables Large Language Models to exhibit targeted behaviors by intervening on intermediate activations without retraining.<n>Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood.<n>We investigate whether the reliability of steering can be diagnosed using internal model signals.
arXiv Detail & Related papers (2026-02-02T06:56:22Z)
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering [62.63376387138257]
We propose a plug-and-play intervention framework that adaptively steers large language models (LLMs) reasoning in activation space.<n>RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input.<n>The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner.
arXiv Detail & Related papers (2026-01-14T08:04:33Z)
Activation Steering for Masked Diffusion Language Models [1.0980666029958932]
Masked diffusion language models generate text through an iterative denoising process.<n>We present an activation-steering framework for MDLMs that computes layer-wise steering vectors from a single forward pass.<n>Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes.
arXiv Detail & Related papers (2025-12-30T11:10:52Z)
Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning [53.35553353785948]
Motivated by the puzzling observation that inserting long sequences of meaningless tokens before the query prompt can consistently enhance reasoning LLM performance, this work analyzes the underlying mechanism driving this phenomenon.<n>We find that the improvements arise from a redistribution of activations in the LLM's layers, where near zero activations become less frequent while large magnitude activations increase.<n>We propose a lightweight inference-time technique that modifies activations directly without altering the input sequence.
arXiv Detail & Related papers (2025-10-01T15:39:38Z)
Scaling laws for activation steering with Llama 2 models and refusal mechanisms [0.13194391758295113]
CAA works by finding desirable 'directions' in the model's residual stream vector space using contrastive pairs.<n>This paper explores the effectiveness of CAA with model scale using the family of Llama 2 models (7B, 13B, and 70B)
arXiv Detail & Related papers (2025-07-15T22:21:18Z)
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint [49.641959856967276]
We present a theoretically grounded and empirically effective activation steering method called AlphaSteer.<n>For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints.<n>Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer.
arXiv Detail & Related papers (2025-06-08T07:03:28Z)
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs [8.085475675888045]
Activation steering provides an alternative for inference-time control.<n>We introduce a novel approach using a lightweight, trainable controller network integrated during inference.
arXiv Detail & Related papers (2025-05-22T01:48:38Z)
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [64.15238674475619]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.<n>We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.<n>We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z)
Steering Llama 2 via Contrastive Activation Addition [41.54815073311959]
Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
arXiv Detail & Related papers (2023-12-09T04:40:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.