Analyzing the Generalization and Reliability of Steering Vectors
- URL: http://arxiv.org/abs/2407.12404v7
- Date: Sun, 26 Jan 2025 05:43:46 GMT
- Title: Analyzing the Generalization and Reliability of Steering Vectors
- Authors: Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk,
- Abstract summary: We show that steering vectors have substantial limitations both in- and out-of-distribution.
In-distribution, steerability is highly variable across different inputs.
Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt.
- Score: 8.253773195379166
- License:
- Abstract: Steering vectors (SVs) have been proposed as an effective approach to adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. In-distribution, steerability is highly variable across different inputs. Depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt, resulting in them failing to generalise well. Overall, our findings show that while steering can work well in the right circumstances, there remain technical difficulties of applying steering vectors to guide models' behaviour at scale. Our code is available at https://github.com/dtch1997/steering-bench
Related papers
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders [73.37603699731329]
We introduce AxBench, a large-scale benchmark for steering and concept detection.
For steering, we find that prompting outperforms all existing methods, followed by finetuning.
For concept detection, representation-based methods such as difference-in-means, perform the best.
arXiv Detail & Related papers (2025-01-28T18:51:24Z) - Steering Large Language Models with Feature Guided Activation Additions [0.0]
We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method.
By operating in the latent space of a Sparse Autoencoder (SAE), FGAA constructs precise steering vectors.
evaluations on Gemma-2-2B and Gemma-2-9B models demonstrate that FGAA outperforms existing steering methods.
arXiv Detail & Related papers (2025-01-17T02:55:23Z) - Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering [0.0]
This paper explores activation engineering, where outputs of pre-trained LLMs are controlled by manipulating their activations at inference time.
We introduce conceptors - mathematical constructs that represent sets of activation vectors as ellipsoidal regions.
Our experiments demonstrate that conceptors outperform traditional methods across multiple steering tasks.
arXiv Detail & Related papers (2024-10-09T10:09:37Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data.
This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization.
Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z) - Extending Activation Steering to Broad Skills and Multiple Behaviours [5.40770929004319]
We investigate the efficacy of activation steering for broad skills and multiple behaviours.
We find that steering broader skills is competitive to steering narrower skills.
We steer models to become more or less myopic and wealth-seeking.
arXiv Detail & Related papers (2024-03-09T02:30:04Z) - Unsupervised Domain Adaptation for Self-Driving from Past Traversal
Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments.
Our approach enhances LiDAR-based detection models using spatial quantized historical features.
Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z) - AI Enhanced Control Engineering Methods [66.08455276899578]
We explore how AI tools can be useful in control applications.
Two immediate applications are linearization of system dynamics for local stability analysis or for state estimation using Kalman filters.
In addition, we explore the use of machine learning models for global parameterizations of state vectors and control inputs in model predictive control applications.
arXiv Detail & Related papers (2023-06-08T20:31:14Z) - Control-Aware Prediction Objectives for Autonomous Driving [78.19515972466063]
We present control-aware prediction objectives (CAPOs) to evaluate the downstream effect of predictions on control without requiring the planner be differentiable.
We propose two types of importance weights that weight the predictive likelihood: one using an attention model between agents, and another based on control variation when exchanging predicted trajectories for ground truth trajectories.
arXiv Detail & Related papers (2022-04-28T07:37:21Z) - Pedestrian Detection: Domain Generalization, CNNs, Transformers and
Beyond [82.37430109152383]
We show that, current pedestrian detectors poorly handle even small domain shifts in cross-dataset evaluation.
We attribute the limited generalization to two main factors, the method and the current sources of data.
We propose a progressive fine-tuning strategy which improves generalization.
arXiv Detail & Related papers (2022-01-10T06:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.