Related papers: HyperSteer: Activation Steering at Scale with Hypernetworks

HyperSteer: Activation Steering at Scale with Hypernetworks

URL: http://arxiv.org/abs/2506.03292v1
Date: Tue, 03 Jun 2025 18:32:01 GMT
Title: HyperSteer: Activation Steering at Scale with Hypernetworks
Authors: Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, Atticus Geiger,
Abstract summary: HyperSteer is a family of hypernetwork-based architectures which are trained end-to-end to generate steering vectors conditioned on the natural language steering prompts.<n>We show that scaling HyperSteer with thousands of steering prompts exceeds the performance of state-of-the-art activation steering methods.
Score: 25.6004576064897
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Steering language models (LMs) by modifying internal activations is a popular approach for controlling text generation. Unsupervised dictionary learning methods, e.g., sparse autoencoders, can be scaled to produce many steering vectors, but lack guarantees on the individual efficacy of each vector and control over the coverage of relevant steering tasks. In contrast, supervised methods for constructing steering vectors are targeted and effective, but require more data collection and training for each additional steering vector produced. In this work, we introduce HyperSteer, a family of hypernetwork-based architectures which are trained end-to-end to generate steering vectors conditioned on the natural language steering prompts and the internals of the steered LM. In our evaluations, we show that scaling HyperSteer with thousands of steering prompts exceeds the performance of state-of-the-art activation steering methods, even on steering prompts never seen during training. Moreover, HyperSteer performs on par with steering-via-prompting.

Related papers

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint [49.641959856967276]
We present a theoretically grounded and empirically effective activation steering method called AlphaSteer.<n>For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints.<n>Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer.
arXiv Detail & Related papers (2025-06-08T07:03:28Z)
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms [71.85633762642125]
The vast number of parameters in models often results in highly intertwined internal representations.<n>Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering.<n>We propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety.
arXiv Detail & Related papers (2025-05-23T17:59:18Z)
Interpretable Steering of Large Language Models with Feature Guided Activation Additions [4.496738719682736]
We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method.<n>By operating in the latent space of a Sparse Autoencoder (SAE), FGAA constructs precise steering vectors.<n> evaluations on Gemma-2-2B and Gemma-2-9B models demonstrate that FGAA outperforms existing steering methods.
arXiv Detail & Related papers (2025-01-17T02:55:23Z)
Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly.<n>We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion.<n>Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z)
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering [0.0]
This paper explores activation engineering, where outputs of pre-trained LLMs are controlled by manipulating their activations at inference time.<n>We introduce conceptors - mathematical constructs that represent sets of activation vectors as ellipsoidal regions.<n>Our experiments demonstrate that conceptors outperform traditional methods across multiple steering tasks.
arXiv Detail & Related papers (2024-10-09T10:09:37Z)
Analyzing the Generalization and Reliability of Steering Vectors [8.253773195379166]
We show that steering vectors have substantial limitations both in- and out-of-distribution.<n>In-distribution, steerability is highly variable across different inputs.<n>Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt.
arXiv Detail & Related papers (2024-07-17T08:32:03Z)
AD-H: Autonomous Driving with Hierarchical Agents [64.49185157446297]
We propose to connect high-level instructions and low-level control signals with mid-level language-driven commands. We implement this idea through a hierarchical multi-agent driving system named AD-H.
arXiv Detail & Related papers (2024-06-05T17:25:46Z)
Improving Activation Steering in Language Models with Mean-Centring [10.101141087916133]
We find that taking the average of activations associated with a target dataset, and subtracting the mean of all training activations, results in effective steering vectors. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin.
arXiv Detail & Related papers (2023-12-06T18:27:07Z)
TLControl: Trajectory and Language Control for Human Motion Synthesis [68.09806223962323]
We present TLControl, a novel method for realistic human motion synthesis. It incorporates both low-level Trajectory and high-level Language semantics controls. It is practical for interactive and high-quality animation generation.
arXiv Detail & Related papers (2023-11-28T18:54:16Z)
Neural Dynamic Policies for End-to-End Sensorimotor Learning [51.24542903398335]
The current dominant paradigm in sensorimotor control, whether imitation or reinforcement learning, is to train policies directly in raw action spaces. We propose Neural Dynamic Policies (NDPs) that make predictions in trajectory distribution space. NDPs outperform the prior state-of-the-art in terms of either efficiency or performance across several robotic control tasks.
arXiv Detail & Related papers (2020-12-04T18:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.