Related papers: Improving Activation Steering in Language Models with Mean-Centring

Improving Activation Steering in Language Models with Mean-Centring

URL: http://arxiv.org/abs/2312.03813v1
Date: Wed, 6 Dec 2023 18:27:07 GMT
Title: Improving Activation Steering in Language Models with Mean-Centring
Authors: Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan
Abstract summary: We find that taking the average of activations associated with a target dataset, and subtracting the mean of all training activations, results in effective steering vectors. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin.
Score: 10.101141087916133
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work in activation steering has demonstrated the potential to better control the outputs of Large Language Models (LLMs), but it involves finding steering vectors. This is difficult because engineers do not typically know how features are represented in these models. We seek to address this issue by applying the idea of mean-centring to steering vectors. We find that taking the average of activations associated with a target dataset, and then subtracting the mean of all training activations, results in effective steering vectors. We test this method on a variety of models on natural language tasks by steering away from generating toxic text, and steering the completion of a story towards a target genre. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin (compared to previous baselines). This suggests that mean-centring can be used to easily improve the effectiveness of activation steering in a wide range of contexts.

Related papers

Enhancing Cross-task Transfer of Large Language Models via Activation Steering [75.41750053623298]
Cross-task in-context learning offers a direct solution for transferring knowledge across tasks.<n>We investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion.<n>We propose a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model's internal activation states.
arXiv Detail & Related papers (2025-07-17T15:47:22Z)
The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs [54.59207567677249]
Large language models (LLMs) still struggle across tasks outside of high-resource languages.<n>In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce.
arXiv Detail & Related papers (2025-05-23T20:28:31Z)
Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z)
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering [0.0]
This paper explores activation engineering, where outputs of pre-trained LLMs are controlled by manipulating their activations at inference time. We introduce conceptors - mathematical constructs that represent sets of activation vectors as ellipsoidal regions. Our experiments demonstrate that conceptors outperform traditional methods across multiple in-context learning steering tasks.
arXiv Detail & Related papers (2024-10-09T10:09:37Z)
Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa. Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z)
An Energy-based Model for Word-level AutoCompletion in Computer-aided Translation [97.3797716862478]
Word-level AutoCompletion (WLAC) is a rewarding yet challenging task in Computer-aided Translation. Existing work addresses this task through a classification model based on a neural network that maps the hidden vector of the input context into its corresponding label. This work proposes an energy-based model for WLAC, which enables the context hidden vector to capture crucial information from the source sentence.
arXiv Detail & Related papers (2024-07-29T15:07:19Z)
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data. This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization. Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z)
Augmented Language Models: a Survey [55.965967655575454]
This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. We refer to them as Augmented Language Models (ALMs) The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks.
arXiv Detail & Related papers (2023-02-15T18:25:52Z)
Extracting Latent Steering Vectors from Pretrained Language Models [14.77762401765532]
We show that latent vectors can be extracted directly from language model decoders without fine-tuning. Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly. We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark.
arXiv Detail & Related papers (2022-05-10T19:04:37Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)
Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.