Related papers: Steer2Edit: From Activation Steering to Component-Level Editing

Steer2Edit: From Activation Steering to Component-Level Editing

URL: http://arxiv.org/abs/2602.09870v1
Date: Tue, 10 Feb 2026 15:15:15 GMT
Title: Steer2Edit: From Activation Steering to Component-Level Editing
Authors: Chung-En Sun, Ge Yan, Zimo Wang, Tsui-Wei Weng,
Abstract summary: We propose Steer2Edit, a training-free framework that transforms steering vectors into diagnostic signals for component rank-1 weight editing.<n>Across safety alignment, attribute mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs.<n>Overall, Steer2Edit provides a principled bridge between representation steering and weight editing.
Score: 24.755027943286432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.

Related papers

Weight Updates as Activation Shifts: A Principled Framework for Steering [54.70188910511715]
Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices.<n>We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior.<n>This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site.
arXiv Detail & Related papers (2026-02-28T02:50:04Z)
AMPS: Adaptive Modality Preference Steering via Functional Entropy [66.69992693275061]
We introduce an instance-aware diagnostic metric that quantifies each modality's information contribution and reveals sample-specific susceptibility to steering.<n> Experimental results show that our instance-aware steering outperforms conventional steering in modulating modality preference.
arXiv Detail & Related papers (2026-02-13T02:29:06Z)
Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics [81.80010043113445]
Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
arXiv Detail & Related papers (2026-02-02T17:04:36Z)
Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection [1.7802147489386628]
Large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors.<n>We propose Selective Steering, which addresses these limitations through two key innovations.<n> Experiments across nine models demonstrate that Selective Steering achieves 5.5x higher attack success rates than prior methods.
arXiv Detail & Related papers (2026-01-27T08:56:25Z)
Activation Steering with a Feedback Controller [4.609594868699996]
Proportional-Integral-Derivative (PID) Steering is a principled framework that leverages the full PID controller for activation steering in large language models.<n>PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.
arXiv Detail & Related papers (2025-10-05T18:05:28Z)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering [26.428347164111926]
Inference-time steering aims to alter a large language model's responses without changing its parameters.<n>Existing approaches often rely on simplistic cues or ad hoc generalizations.<n>We introduce REAL, a framework for identifying behavior-relevant modules in Transformer models.
arXiv Detail & Related papers (2025-06-10T02:16:50Z)
Fusion Steering: Prompt-Specific Activation Control [0.0]
Fusion Steering improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks.<n>This approach introduces flexible steering configurations, including full-layer steering and segmented steering.<n>Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%.
arXiv Detail & Related papers (2025-05-28T16:46:55Z)
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender [99.3105257001476]
We propose AdaSteer, an adaptive activation steering method that adjusts model behavior based on input characteristics.<n>AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD)<n>Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
arXiv Detail & Related papers (2025-04-13T07:39:17Z)
Joint Localization and Activation Editing for Low-Resource Fine-Tuning [73.64004083269424]
We propose a joint localization and activation editing (JoLA) method.<n>JoLA learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves.<n>We demonstrate that JoLA consistently outperforms existing methods.
arXiv Detail & Related papers (2025-02-03T09:13:09Z)
Control-Aware Prediction Objectives for Autonomous Driving [78.19515972466063]
We present control-aware prediction objectives (CAPOs) to evaluate the downstream effect of predictions on control without requiring the planner be differentiable. We propose two types of importance weights that weight the predictive likelihood: one using an attention model between agents, and another based on control variation when exchanging predicted trajectories for ground truth trajectories.
arXiv Detail & Related papers (2022-04-28T07:37:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.