Related papers: Understanding (Un)Reliability of Steering Vectors in Language Models

Understanding (Un)Reliability of Steering Vectors in Language Models

URL: http://arxiv.org/abs/2505.22637v1
Date: Wed, 28 May 2025 17:53:31 GMT
Title: Understanding (Un)Reliability of Steering Vectors in Language Models
Authors: Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov,
Abstract summary: This paper studies the influence of prompt types and the geometry of activation differences on steering reliability.<n>We find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one.
Score: 21.33093425619501
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Steering vectors are a lightweight method to control language model behavior by adding a learned bias to the activations at inference time. Although steering demonstrates promising performance, recent work shows that it can be unreliable or even counterproductive in some cases. This paper studies the influence of prompt types and the geometry of activation differences on steering reliability. First, we find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one. No prompt type clearly outperforms the others, and yet the steering vectors resulting from the different prompt types often differ directionally (as measured by cosine similarity). Second, we show that higher cosine similarity between training set activation differences predicts more effective steering. Finally, we observe that datasets where positive and negative activations are better separated are more steerable. Our results suggest that vector steering is unreliable when the target behavior is not represented by a coherent direction.

Related papers

Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly.<n>We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion.<n>Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z)
Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa. Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z)
Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling [51.38330727868982]
We show how action chunking impacts the divergence between a learner and a demonstrator.<n>We propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop adaptation.<n>Our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z)
Analyzing the Generalization and Reliability of Steering Vectors [8.253773195379166]
We show that steering vectors have substantial limitations both in- and out-of-distribution.<n>In-distribution, steerability is highly variable across different inputs.<n>Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt.
arXiv Detail & Related papers (2024-07-17T08:32:03Z)
Extending Activation Steering to Broad Skills and Multiple Behaviours [5.40770929004319]
We investigate the efficacy of activation steering for broad skills and multiple behaviours. We find that steering broader skills is competitive to steering narrower skills. We steer models to become more or less myopic and wealth-seeking.
arXiv Detail & Related papers (2024-03-09T02:30:04Z)
Improving Activation Steering in Language Models with Mean-Centring [10.101141087916133]
We find that taking the average of activations associated with a target dataset, and subtracting the mean of all training activations, results in effective steering vectors. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin.
arXiv Detail & Related papers (2023-12-06T18:27:07Z)
Margin-based sampling in high dimensions: When being active is less efficient than staying passive [76.71565772067113]
Recent empirical evidence suggests that margin-based active learning can sometimes perform even worse than passive learning. We prove for logistic regression that PL outperforms margin-based AL even for noiseless data. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small.
arXiv Detail & Related papers (2022-12-01T18:55:59Z)
Extracting Latent Steering Vectors from Pretrained Language Models [14.77762401765532]
We show that latent vectors can be extracted directly from language model decoders without fine-tuning. Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly. We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark.
arXiv Detail & Related papers (2022-05-10T19:04:37Z)
Why Do Self-Supervised Models Transfer? Investigating the Impact of Invariance on Downstream Tasks [79.13089902898848]
Self-supervised learning is a powerful paradigm for representation learning on unlabelled images. We show that different tasks in computer vision require features to encode different (in)variances.
arXiv Detail & Related papers (2021-11-22T18:16:35Z)
Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning. We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z)
Multi-type Disentanglement without Adversarial Training [48.51678740102892]
Controlling the style of natural language by disentangling the latent space is an important step towards interpretable machine learning. We propose a unified distribution-controlling method, which provides each specific style value with a unique representation. We also propose multiple loss functions to achieve a style-content disentanglement as well as a disentanglement among multiple style types.
arXiv Detail & Related papers (2020-12-16T11:47:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.