Related papers: Analyzing the Generalization and Reliability of Steering Vectors

Related papers

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
HyperSteer: Activation Steering at Scale with Hypernetworks [25.6004576064897]
HyperSteer is a family of hypernetwork-based architectures which are trained end-to-end to generate steering vectors conditioned on the natural language steering prompts.<n>We show that scaling HyperSteer with thousands of steering prompts exceeds the performance of state-of-the-art activation steering methods.
arXiv Detail & Related papers (2025-06-03T18:32:01Z)
Understanding (Un)Reliability of Steering Vectors in Language Models [21.33093425619501]
This paper studies the influence of prompt types and the geometry of activation differences on steering reliability.<n>We find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one.
arXiv Detail & Related papers (2025-05-28T17:53:31Z)
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms [71.85633762642125]
The vast number of parameters in models often results in highly intertwined internal representations.<n>Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering.<n>We propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety.
arXiv Detail & Related papers (2025-05-23T17:59:18Z)
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders [73.37603699731329]
We introduce AxBench, a large-scale benchmark for steering and concept detection. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best.
arXiv Detail & Related papers (2025-01-28T18:51:24Z)
Interpretable Steering of Large Language Models with Feature Guided Activation Additions [4.496738719682736]
We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method. By operating in the latent space of a Sparse Autoencoder (SAE), FGAA constructs precise steering vectors. evaluations on Gemma-2-2B and Gemma-2-9B models demonstrate that FGAA outperforms existing steering methods.
arXiv Detail & Related papers (2025-01-17T02:55:23Z)
Improving Steering Vectors by Targeting Sparse Autoencoder Features [2.4188584949331053]
We develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects. We show that SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
arXiv Detail & Related papers (2024-11-04T15:46:20Z)
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering [0.0]
This paper explores activation engineering, where outputs of pre-trained LLMs are controlled by manipulating their activations at inference time. We introduce conceptors - mathematical constructs that represent sets of activation vectors as ellipsoidal regions. Our experiments demonstrate that conceptors outperform traditional methods across multiple steering tasks.
arXiv Detail & Related papers (2024-10-09T10:09:37Z)
Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z)
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data. This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization. Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z)
Towards Generalizable and Interpretable Motion Prediction: A Deep Variational Bayes Approach [54.429396802848224]
This paper proposes an interpretable generative model for motion prediction with robust generalizability to out-of-distribution cases. For interpretability, the model achieves the target-driven motion prediction by estimating the spatial distribution of long-term destinations. Experiments on motion prediction datasets validate that the fitted model can be interpretable and generalizable.
arXiv Detail & Related papers (2024-03-10T04:16:04Z)
Extending Activation Steering to Broad Skills and Multiple Behaviours [5.40770929004319]
We investigate the efficacy of activation steering for broad skills and multiple behaviours. We find that steering broader skills is competitive to steering narrower skills. We steer models to become more or less myopic and wealth-seeking.
arXiv Detail & Related papers (2024-03-09T02:30:04Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Steering Llama 2 via Contrastive Activation Addition [41.54815073311959]
Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
arXiv Detail & Related papers (2023-12-09T04:40:46Z)
Unsupervised Domain Adaptation for Self-Driving from Past Traversal Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments. Our approach enhances LiDAR-based detection models using spatial quantized historical features. Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z)
AI Enhanced Control Engineering Methods [66.08455276899578]
We explore how AI tools can be useful in control applications. Two immediate applications are linearization of system dynamics for local stability analysis or for state estimation using Kalman filters. In addition, we explore the use of machine learning models for global parameterizations of state vectors and control inputs in model predictive control applications.
arXiv Detail & Related papers (2023-06-08T20:31:14Z)
On Learning the Tail Quantiles of Driving Behavior Distributions via Quantile Regression and Flows [13.540998552232006]
We consider the problem of learning models that accurately capture the diversity and tail quantiles of human driver behavior probability distributions. We adapt two flexible quantile learning frameworks for this setting that avoid strong distributional assumptions. We evaluate our approach in a one-step acceleration prediction task, and in multi-step driver simulation rollouts.
arXiv Detail & Related papers (2023-05-22T15:09:04Z)
Control-Aware Prediction Objectives for Autonomous Driving [78.19515972466063]
We present control-aware prediction objectives (CAPOs) to evaluate the downstream effect of predictions on control without requiring the planner be differentiable. We propose two types of importance weights that weight the predictive likelihood: one using an attention model between agents, and another based on control variation when exchanging predicted trajectories for ground truth trajectories.
arXiv Detail & Related papers (2022-04-28T07:37:21Z)
Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond [82.37430109152383]
We show that, current pedestrian detectors poorly handle even small domain shifts in cross-dataset evaluation. We attribute the limited generalization to two main factors, the method and the current sources of data. We propose a progressive fine-tuning strategy which improves generalization.
arXiv Detail & Related papers (2022-01-10T06:00:26Z)
Causally-motivated Shortcut Removal Using Auxiliary Labels [63.686580185674195]
Key challenge to learning such risk-invariant predictors is shortcut learning. We propose a flexible, causally-motivated approach to address this challenge. We show both theoretically and empirically that this causally-motivated regularization scheme yields robust predictors.
arXiv Detail & Related papers (2021-05-13T16:58:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.