Steering Large Language Models with Feature Guided Activation Additions
- URL: http://arxiv.org/abs/2501.09929v2
- Date: Mon, 20 Jan 2025 02:51:47 GMT
- Title: Steering Large Language Models with Feature Guided Activation Additions
- Authors: Samuel Soo, Wesley Teng, Chandrasekaran Balaganesh,
- Abstract summary: We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method.<n>By operating in the latent space of a Sparse Autoencoder (SAE), FGAA constructs precise steering vectors.<n> evaluations on Gemma-2-2B and Gemma-2-9B models demonstrate that FGAA outperforms existing steering methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective and reliable control over large language model (LLM) behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate that FGAA outperforms existing steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also highlight important trade-offs between steering scale and general model capabilities that are consistent across all tested steering methods.
Related papers
- ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment [49.68063561145927]
We propose a unified ordinary differential equations (ODEs)-based theoretical framework for activation steering.<n>We introduce ODESteer, a kind of ODE-based steering guided by barrier functions.<n>Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements.
arXiv Detail & Related papers (2026-02-19T17:13:44Z) - Mechanistic Indicators of Steering Effectiveness in Large Language Models [3.635648354808971]
Activation-based steering enables Large Language Models to exhibit targeted behaviors by intervening on intermediate activations without retraining.<n>Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood.<n>We investigate whether the reliability of steering can be diagnosed using internal model signals.
arXiv Detail & Related papers (2026-02-02T06:56:22Z) - Dynamically Scaled Activation Steering [3.177576903071419]
We introduce Dynamically Scaled Activation Steering (DSAS), a method-agnostic steering framework that decouples when to steer from how to steer.<n>DSAS adaptively modulates the strength of existing steering transformations across layers and inputs, intervening strongly only when undesired behavior is detected.
arXiv Detail & Related papers (2025-12-03T10:50:15Z) - Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders [63.544453925182005]
We train 90 SAEs across three language models and evaluate their interpretability and steering utility.<n>Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance.<n>We propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution.
arXiv Detail & Related papers (2025-10-04T04:14:50Z) - Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement [31.282134977964976]
Existing steering methods rely on large-scale datasets to learn clear behavioral information.<n>We introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors.<n>In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features.
arXiv Detail & Related papers (2025-09-28T10:49:22Z) - GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - Scaling laws for activation steering with Llama 2 models and refusal mechanisms [0.13194391758295113]
CAA works by finding desirable 'directions' in the model's residual stream vector space using contrastive pairs.<n>This paper explores the effectiveness of CAA with model scale using the family of Llama 2 models (7B, 13B, and 70B)
arXiv Detail & Related papers (2025-07-15T22:21:18Z) - KV Cache Steering for Controlling Frozen LLMs [80.50365534625438]
cache steering is a lightweight method for implicit steering of language models.<n>We apply cache steering to induce chain-of-thought reasoning in small language models.
arXiv Detail & Related papers (2025-07-11T17:59:36Z) - HyperSteer: Activation Steering at Scale with Hypernetworks [25.6004576064897]
HyperSteer is a family of hypernetwork-based architectures which are trained end-to-end to generate steering vectors conditioned on the natural language steering prompts.<n>We show that scaling HyperSteer with thousands of steering prompts exceeds the performance of state-of-the-art activation steering methods.
arXiv Detail & Related papers (2025-06-03T18:32:01Z) - Improved Representation Steering for Language Models [50.86411958644953]
We show how to improve representation steering via our new Reference-free Preference Steering (RePS)<n>On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective.<n>In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants.
arXiv Detail & Related papers (2025-05-27T07:16:40Z) - Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms [71.85633762642125]
The vast number of parameters in models often results in highly intertwined internal representations.<n>Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering.<n>We propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety.
arXiv Detail & Related papers (2025-05-23T17:59:18Z) - Effectively Steer LLM To Follow Preference via Building Confident Directions [39.40603123075168]
We propose a theoretical framework to understand and quantify the model steering methods.
Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations.
Our approach offers three key advantages over popular bidirectional model steering methods.
arXiv Detail & Related papers (2025-03-04T20:32:27Z) - Improving Steering Vectors by Targeting Sparse Autoencoder Features [2.4188584949331053]
We develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects.
We show that SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
arXiv Detail & Related papers (2024-11-04T15:46:20Z) - Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly.
We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion.
Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z) - Analyzing the Generalization and Reliability of Steering Vectors [8.253773195379166]
We show that steering vectors have substantial limitations both in- and out-of-distribution.
In-distribution, steerability is highly variable across different inputs.
Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt.
arXiv Detail & Related papers (2024-07-17T08:32:03Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Steering Llama 2 via Contrastive Activation Addition [41.54815073311959]
Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes.
CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
arXiv Detail & Related papers (2023-12-09T04:40:46Z) - Improving Activation Steering in Language Models with Mean-Centring [10.101141087916133]
We find that taking the average of activations associated with a target dataset, and subtracting the mean of all training activations, results in effective steering vectors.
We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin.
arXiv Detail & Related papers (2023-12-06T18:27:07Z) - Empowering Autonomous Driving with Large Language Models: A Safety Perspective [82.90376711290808]
This paper explores the integration of Large Language Models (LLMs) into Autonomous Driving systems.
LLMs are intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning.
We present two key studies in a simulated environment: an adaptive LLM-conditioned Model Predictive Control (MPC) and an LLM-enabled interactive behavior planning scheme with a state machine.
arXiv Detail & Related papers (2023-11-28T03:13:09Z) - OSCAR: Data-Driven Operational Space Control for Adaptive and Robust
Robot Manipulation [50.59541802645156]
Operational Space Control (OSC) has been used as an effective task-space controller for manipulation.
We propose OSC for Adaptation and Robustness (OSCAR), a data-driven variant of OSC that compensates for modeling errors.
We evaluate our method on a variety of simulated manipulation problems, and find substantial improvements over an array of controller baselines.
arXiv Detail & Related papers (2021-10-02T01:21:38Z) - A Driving Behavior Recognition Model with Bi-LSTM and Multi-Scale CNN [59.57221522897815]
We propose a neural network model based on trajectories information for driving behavior recognition.
We evaluate the proposed model on the public BLVD dataset, achieving a satisfying performance.
arXiv Detail & Related papers (2021-03-01T06:47:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.