Related papers: Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

URL: http://arxiv.org/abs/2602.17881v1
Date: Thu, 19 Feb 2026 22:37:05 GMT
Title: Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations
Authors: Joschka Braun,
Abstract summary: I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data.<n>I find that higher cosine similarity between training activation differences predicts more reliable steering.<n>I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.

Related papers

AMPS: Adaptive Modality Preference Steering via Functional Entropy [66.69992693275061]
We introduce an instance-aware diagnostic metric that quantifies each modality's information contribution and reveals sample-specific susceptibility to steering.<n> Experimental results show that our instance-aware steering outperforms conventional steering in modulating modality preference.
arXiv Detail & Related papers (2026-02-13T02:29:06Z)
Steering Latent Traits, Not Learned Facts: An Empirical Study of Activation Control Limits [0.0]
Large language models (LLMs) require precise behavior control for safe and effective deployment across diverse applications.<n>We focus on the question of how steering effectiveness varies across different behavior types and whether the nature of target behaviors can predict steering success.
arXiv Detail & Related papers (2025-11-23T04:28:41Z)
DISCO: Disentangled Communication Steering for Large Language Models [3.4065590965511436]
We propose to inject steering vectors directly into the query and value representation spaces within attention heads.<n>We analytically characterize the effect of our method, which we term DISentangled COmmunication (DISCO) Steering, on attention head outputs.
arXiv Detail & Related papers (2025-09-20T21:56:03Z)
KV Cache Steering for Controlling Frozen LLMs [80.50365534625438]
cache steering is a lightweight method for implicit steering of language models.<n>We apply cache steering to induce chain-of-thought reasoning in small language models.
arXiv Detail & Related papers (2025-07-11T17:59:36Z)
Understanding (Un)Reliability of Steering Vectors in Language Models [21.33093425619501]
This paper studies the influence of prompt types and the geometry of activation differences on steering reliability.<n>We find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one.
arXiv Detail & Related papers (2025-05-28T17:53:31Z)
SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.931194824519935]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z)
Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly.<n>We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion.<n>Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z)
Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa. Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z)
Analyzing the Generalization and Reliability of Steering Vectors [8.253773195379166]
We show that steering vectors have substantial limitations both in- and out-of-distribution.<n>In-distribution, steerability is highly variable across different inputs.<n>Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt.
arXiv Detail & Related papers (2024-07-17T08:32:03Z)
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data. This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization. Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z)
Trajectory Forecasting from Detection with Uncertainty-Aware Motion Encoding [121.66374635092097]
Trajectories obtained from object detection and tracking are inevitably noisy. We propose a trajectory predictor directly based on detection results without relying on explicitly formed trajectories.
arXiv Detail & Related papers (2022-02-03T09:09:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.