Related papers: Improving Instruction-Following in Language Models through Activation Steering

Improving Instruction-Following in Language Models through Activation Steering

URL: http://arxiv.org/abs/2410.12877v1
Date: Tue, 15 Oct 2024 08:38:20 GMT
Title: Improving Instruction-Following in Language Models through Activation Steering
Authors: Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi,
Abstract summary: We derive instruction-specific vector representations from language models and use them to steer models accordingly. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
Score: 58.876600545898675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of activation steering, successfully applying multiple instructions simultaneously. Finally, we demonstrate that steering vectors computed on instruction-tuned models can transfer to improve base models. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.

Related papers

HyperSteer: Activation Steering at Scale with Hypernetworks [25.6004576064897]
HyperSteer is a family of hypernetwork-based architectures which are trained end-to-end to generate steering vectors conditioned on the natural language steering prompts.<n>We show that scaling HyperSteer with thousands of steering prompts exceeds the performance of state-of-the-art activation steering methods.
arXiv Detail & Related papers (2025-06-03T18:32:01Z)
Improving Reasoning Performance in Large Language Models via Representation Engineering [2.0099933815960256]
We propose a representation engineering approach for large language models (LLMs) Model activations are read from the residual stream of an LLM when processing a reasoning task. We show that an LLM can, to a certain degree, be controlled to improve its perceived reasoning ability by modulating activations.
arXiv Detail & Related papers (2025-04-28T04:58:43Z)
Improving Instruct Models for Free: A Study on Partial Adaptation [24.14141732514014]
We study the performance trajectory between base and instruct models by scaling down the strength of instruction-tuning. We show that, across several model families and model sizes, reducing the strength of instruction-tuning results in material improvement on a few-shot in-context learning benchmark.
arXiv Detail & Related papers (2025-04-15T21:35:09Z)
In-context Learning vs. Instruction Tuning: The Case of Small and Multilingual Language Models [3.069335774032178]
We show that scenarios involving multilingual and smaller models result in downgraded ICL instruction following performance. This study aims to further our understanding of current strengths and limitations of alternative methods for instruction following.
arXiv Detail & Related papers (2025-03-03T14:47:23Z)
LatentQA: Teaching LLMs to Decode Activations Into Natural Language [72.87064562349742]
We introduce LatentQA, the task of answering open-ended questions about model activations in natural language. We propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations.
arXiv Detail & Related papers (2024-12-11T18:59:33Z)
MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models [79.0546136194314]
We present a novel instruction tuning recipe to improve the zero-shot task generalization of multimodal large language models. We evaluate the performance of the proposed approach on 9 unseen datasets across both language and vision modalities.
arXiv Detail & Related papers (2024-11-15T20:09:59Z)
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering [0.0]
This paper explores activation engineering, where outputs of pre-trained LLMs are controlled by manipulating their activations at inference time. We introduce conceptors - mathematical constructs that represent sets of activation vectors as ellipsoidal regions. Our experiments demonstrate that conceptors outperform traditional methods across multiple in-context learning steering tasks.
arXiv Detail & Related papers (2024-10-09T10:09:37Z)
Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa. Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z)
Controllable Navigation Instruction Generation with Chain of Thought Prompting [74.34604350917273]
We propose C-Instructor, which utilizes the chain-of-thought-style prompt for style-controllable and content-controllable instruction generation. C-Instructor renders generated instructions more accessible to follow and offers greater controllability over the manipulation of landmark objects.
arXiv Detail & Related papers (2024-07-10T07:37:20Z)
Transformer-based Causal Language Models Perform Clustering [20.430255724239448]
We introduce a simplified instruction-following task and use synthetic datasets to analyze a Transformer-based causal language model. Our findings suggest that the model learns task-specific information by clustering data within its hidden space, with this clustering process evolving dynamically during learning.
arXiv Detail & Related papers (2024-02-19T14:02:31Z)
Improving Activation Steering in Language Models with Mean-Centring [10.101141087916133]
We find that taking the average of activations associated with a target dataset, and subtracting the mean of all training activations, results in effective steering vectors. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin.
arXiv Detail & Related papers (2023-12-06T18:27:07Z)
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z)
Instruction-following Evaluation through Verbalizer Manipulation [64.73188776428799]
We propose a novel instruction-following evaluation protocol called verbalizer manipulation. It instructs the model to verbalize the task label with words aligning with model priors to different extents. We observe that the instruction-following abilities of models, across different families and scales, are significantly distinguished by their performance on less natural verbalizers.
arXiv Detail & Related papers (2023-07-20T03:54:24Z)
Few-shot Prompting Towards Controllable Response Generation [49.479958672988566]
We first explored the combination of prompting and reinforcement learning (RL) to steer models' generation without accessing any of the models' parameters. We apply multi-task learning to make the model learn to generalize to new tasks better. Experiment results show that our proposed method can successfully control several state-of-the-art (SOTA) dialogue models without accessing their parameters.
arXiv Detail & Related papers (2022-06-08T14:48:06Z)
Skill Induction and Planning with Latent Language [94.55783888325165]
We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions. We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks. In trained models, the space of natural language commands indexes a library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals.
arXiv Detail & Related papers (2021-10-04T15:36:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.