Related papers: Towards Understanding Steering Strength

Towards Understanding Steering Strength

URL: http://arxiv.org/abs/2602.02712v1
Date: Mon, 02 Feb 2026 19:25:37 GMT
Title: Towards Understanding Steering Strength
Authors: Magamed Taimeskhanov, Samuel Vaiter, Damien Garreau,
Abstract summary: A popular approach to post-training control of large language models is the steering of intermediate latent representations.<n>In this work, we propose the first theoretical analysis of steering strength.<n>Our analysis reveals surprising behaviors, including non-monotonic effects of steering strength.
Score: 15.203729631608253
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A popular approach to post-training control of large language models (LLMs) is the steering of intermediate latent representations. Namely, identify a well-chosen direction depending on the task at hand and perturbs representations along this direction at inference time. While many propositions exist to pick this direction, considerably less is understood about how to choose the magnitude of the move, whereas its importance is clear: too little and the intended behavior does not emerge, too much and the model's performance degrades beyond repair. In this work, we propose the first theoretical analysis of steering strength. We characterize its effect on next token probability, presence of a concept, and cross-entropy, deriving precise qualitative laws governing these quantities. Our analysis reveals surprising behaviors, including non-monotonic effects of steering strength. We validate our theoretical predictions empirically on eleven language models, ranging from a small GPT architecture to modern models.

Related papers

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics [81.80010043113445]
Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
arXiv Detail & Related papers (2026-02-02T17:04:36Z)
From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z)
Momentum Point-Perplexity Mechanics in Large Language Models [0.0]
We take a physics-based approach to studying how the internal hidden states of large language models change from token to token during inference.<n>We find that a quantity combining the rate of change in hidden states and the model's next-token certainty, analogous to energy in physics, remains nearly constant.<n>We derive a control method called Jacobian steering, which perturbs hidden states in the minimal way needed to favor a target token.
arXiv Detail & Related papers (2025-08-11T21:50:34Z)
Reasoning-Finetuning Repurposes Latent Representations in Base Models [1.3286418032136589]
Backtracking, an emergent behavior elicited by reasoning fine-tuning, has been shown to be a key mechanism in reasoning models' enhanced capabilities.<n>We show that the emergence of backtracking is in part driven by a repurposed direction already present in base model activations.
arXiv Detail & Related papers (2025-07-16T21:21:03Z)
KV Cache Steering for Controlling Frozen LLMs [80.50365534625438]
cache steering is a lightweight method for implicit steering of language models.<n>We apply cache steering to induce chain-of-thought reasoning in small language models.
arXiv Detail & Related papers (2025-07-11T17:59:36Z)
Weight Spectra Induced Efficient Model Adaptation [54.8615621415845]
Fine-tuning large-scale foundation models incurs prohibitive computational costs.<n>We show that fine-tuning predominantly amplifies the top singular values while leaving the remainder largely intact.<n>We propose a novel method that leverages learnable rescaling of top singular directions.
arXiv Detail & Related papers (2025-05-29T05:03:29Z)
Mitigating Overthinking in Large Reasoning Models via Manifold Steering [32.666911833023526]
Large Reasoning Models (LRMs) exhibit a phenomenon known as overthinking during inference.<n>We propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold.<n>Our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks.
arXiv Detail & Related papers (2025-05-28T14:39:26Z)
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think [81.38614558541772]
We introduce the CoT Encyclopedia, a framework for analyzing and steering model reasoning.<n>Our method automatically extracts diverse reasoning criteria from model-generated CoTs.<n>We show that this framework produces more interpretable and comprehensive analyses than existing methods.
arXiv Detail & Related papers (2025-05-15T11:31:02Z)
Symmetric Pruning of Large Language Models [61.309982086292756]
Popular post-training pruning methods such as Wanda and RIA are known for their simple, yet effective, designs.<n>This paper introduces new theoretical insights that redefine the standard minimization objective for pruning.<n>We propose complementary strategies that consider both input activations and weight significance.
arXiv Detail & Related papers (2025-01-31T09:23:06Z)
A Timeline and Analysis for Representation Plasticity in Large Language Models [0.0]
This paper aims to understand how "honesty" and model plasticity evolve by applying steering extracted at different fine-tuning stages. The findings are pivotal, showing that while early steering exhibits high plasticity, later stages have a surprisingly responsive critical window. These insights greatly contribute to the field of AI transparency, addressing a pressing lack of efficiency limiting our ability to effectively steer model behavior.
arXiv Detail & Related papers (2024-10-08T17:34:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.