Improving Steering Vectors by Targeting Sparse Autoencoder Features
- URL: http://arxiv.org/abs/2411.02193v1
- Date: Mon, 04 Nov 2024 15:46:20 GMT
- Title: Improving Steering Vectors by Targeting Sparse Autoencoder Features
- Authors: Sviatoslav Chalnev, Matthew Siu, Arthur Conmy,
- Abstract summary: We use SAEs to measure the effects of steering vectors, giving us a method that can be used to understand the causal effect of any steering vector intervention.
We develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects.
- Score: 2.4188584949331053
- License:
- Abstract: To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a promising method of model control that is easier than finetuning, and may be more robust than prompting. However, it can be difficult to anticipate the effects of steering vectors produced by almost all existing methods, such as CAA (Panickssery et al., 2024) or the direct use of SAE latents (Templeton et al., 2024). In our work, we address this issue by using SAEs to measure the effects of steering vectors, giving us a method that can be used to understand the causal effect of any steering vector intervention. We use this method for measuring causal effects to develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects. We show that overall, SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
Related papers
- Analyzing the Generalization and Reliability of Steering Vectors [8.253773195379166]
We show that steering vectors have substantial limitations both in- and out-of-distribution.
In-distribution, steerability is highly variable across different inputs.
Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt.
arXiv Detail & Related papers (2024-07-17T08:32:03Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data.
This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization.
Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z) - Angle Robustness Unmanned Aerial Vehicle Navigation in GNSS-Denied
Scenarios [66.05091704671503]
We present a novel angle navigation paradigm to deal with flight deviation in point-to-point navigation tasks.
We also propose a model that includes the Adaptive Feature Enhance Module, Cross-knowledge Attention-guided Module and Robust Task-oriented Head Module.
arXiv Detail & Related papers (2024-02-04T08:41:20Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Towards Automated Driving Violation Cause Analysis in Scenario-Based
Testing for Autonomous Driving Systems [22.872694649245044]
We propose a novel driving violation cause analysis (DVCA) tool.
Our tool can achieve perfect component-level attribution accuracy (100%) and almost (>98%) perfect message-level accuracy.
arXiv Detail & Related papers (2024-01-19T01:12:37Z) - Steering Llama 2 via Contrastive Activation Addition [41.54815073311959]
Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes.
CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
arXiv Detail & Related papers (2023-12-09T04:40:46Z) - Unsupervised Domain Adaptation for Self-Driving from Past Traversal
Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments.
Our approach enhances LiDAR-based detection models using spatial quantized historical features.
Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z) - Tuning Legged Locomotion Controllers via Safe Bayesian Optimization [47.87675010450171]
This paper presents a data-driven strategy to streamline the deployment of model-based controllers in legged robotic hardware platforms.
We leverage a model-free safe learning algorithm to automate the tuning of control gains, addressing the mismatch between the simplified model used in the control formulation and the real system.
arXiv Detail & Related papers (2023-06-12T13:10:14Z) - Effects of Augmented-Reality-Based Assisting Interfaces on Drivers'
Object-wise Situational Awareness in Highly Autonomous Vehicles [13.311257059976692]
We focus on a user interface based on augmented reality (AR), which can highlight potential hazards on the road.
Our study results show that the effects of highlighting on drivers' SA varied by traffic densities, object locations and object types.
arXiv Detail & Related papers (2022-06-06T03:23:34Z) - Control-Aware Prediction Objectives for Autonomous Driving [78.19515972466063]
We present control-aware prediction objectives (CAPOs) to evaluate the downstream effect of predictions on control without requiring the planner be differentiable.
We propose two types of importance weights that weight the predictive likelihood: one using an attention model between agents, and another based on control variation when exchanging predicted trajectories for ground truth trajectories.
arXiv Detail & Related papers (2022-04-28T07:37:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.