Related papers: Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

URL: http://arxiv.org/abs/2602.02343v2
Date: Wed, 04 Feb 2026 12:45:27 GMT
Title: Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
Authors: Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, Huajun Chen, Ningyu Zhang,
Abstract summary: Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
Score: 81.80010043113445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

Related papers

Weight Updates as Activation Shifts: A Principled Framework for Steering [54.70188910511715]
Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices.<n>We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior.<n>This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site.
arXiv Detail & Related papers (2026-02-28T02:50:04Z)
Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions [37.08071497197165]
Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning.<n>We build on the principles of distributed alignment search to propose a new steering method: Concept DAS.<n>We show that Concept DAS does not always outperform preference-optimization methods but may benefit more from increased model scale.
arXiv Detail & Related papers (2026-02-05T02:51:00Z)
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation [60.33386541343322]
We propose a Multimodal Large Language Models framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec)<n>Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness.
arXiv Detail & Related papers (2025-11-24T04:10:46Z)
ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention [86.93601565563954]
ScaleWeaver is a framework designed to achieve high-fidelity, controllable generation upon advanced visual autoregressive( VAR) models.<n>The proposed Reference Attention module discards the unnecessary attention from image$rightarrow$condition, reducing computational cost.<n>Experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods.
arXiv Detail & Related papers (2025-10-16T17:00:59Z)
The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features [1.7832672957068079]
We introduce Feature Steering with Reinforcement Learning, a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features.<n>We show that this mechanism is principled and expressive enough to approximate the behavioral shifts of post-training processes.<n>Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.
arXiv Detail & Related papers (2025-09-16T10:32:40Z)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
KV Cache Steering for Controlling Frozen LLMs [80.50365534625438]
cache steering is a lightweight method for implicit steering of language models.<n>We apply cache steering to induce chain-of-thought reasoning in small language models.
arXiv Detail & Related papers (2025-07-11T17:59:36Z)
Learning Distribution-Wise Control in Representation Space for Language Models [7.756342860929851]
Learnable interventions aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors.<n>We extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace.
arXiv Detail & Related papers (2025-06-07T06:52:58Z)
Improved Representation Steering for Language Models [50.86411958644953]
We show how to improve representation steering via our new Reference-free Preference Steering (RePS)<n>On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective.<n>In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants.
arXiv Detail & Related papers (2025-05-27T07:16:40Z)
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors [13.630818884973127]
We propose Preference Vector, a novel framework inspired by task arithmetic.<n>Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time.<n> Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.
arXiv Detail & Related papers (2025-04-27T12:16:51Z)
Steering Llama 2 via Contrastive Activation Addition [41.54815073311959]
Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
arXiv Detail & Related papers (2023-12-09T04:40:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.