Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
- URL: http://arxiv.org/abs/2410.12299v1
- Date: Wed, 16 Oct 2024 06:58:49 GMT
- Title: Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
- Authors: Weixuan Wang, Jingyuan Yang, Wei Peng,
- Abstract summary: Activation intervention has emerged as an effective and economical method to modify the behavior of large language models (LLMs)
We propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time.
Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training.
- Score: 8.761404991620285
- License:
- Abstract: Large language models (LLMs) have achieved remarkable performance across many tasks, yet aligning them with desired behaviors remains challenging. Activation intervention has emerged as an effective and economical method to modify the behavior of LLMs. Despite considerable interest in this area, current intervention methods exclusively employ a fixed steering vector to modify model activations, lacking adaptability to diverse input semantics. To address this limitation, we propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time. More specifically, SADI utilizes activation differences in contrastive pairs to precisely identify critical elements of an LLM (i.e., attention heads, hidden states, and neurons) for targeted intervention. During inference, SADI dynamically steers model behavior by scaling element-wise activations based on the directions of input semantics. Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training. SADI's cost-effectiveness and generalizability across various LLM backbones and tasks highlight its potential as a versatile alignment technique. In addition, we release the code to foster research along this line:https://github.com/weixuan-wang123/SADI.
Related papers
- Multi-Attribute Steering of Language Models via Targeted Intervention [56.93583799109029]
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction.
We introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes.
arXiv Detail & Related papers (2025-02-18T02:27:23Z) - Task-driven Layerwise Additive Activation Intervention [12.152228552335798]
Modern language models (LMs) have significantly advanced generative modeling in natural language processing (NLP)
This paper proposes a layer-wise additive activation intervention framework that optimize the intervention process.
We benchmark our framework on various datasets, demonstrating improvements in the accuracy of pre-trained LMs and competing intervention baselines.
arXiv Detail & Related papers (2025-02-10T02:49:46Z) - LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models [16.37602070339033]
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs.
We propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency.
Our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder.
arXiv Detail & Related papers (2025-01-19T13:06:51Z) - Transformer-Squared: Self-adaptive LLMs [29.1326358746118]
We introduce Transformer-Squared, a novel self-adaptation framework that adapts large language models for unseen tasks in real-time.
Our method consistently outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency.
Transformer-Squared represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs.
arXiv Detail & Related papers (2025-01-09T01:19:21Z) - First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models [25.15698344467722]
This paper introduces a training-free Threshold-based Dynamic Activation method that leverage sequence information to exploit the inherent sparsity of models across various architectures.
We theoretically analyze two of its critical features: history-related activation uncertainty and semantic-irrelevant activation inertia.
arXiv Detail & Related papers (2024-08-21T07:38:51Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Learning Objective-Specific Active Learning Strategies with Attentive
Neural Processes [72.75421975804132]
Learning Active Learning (LAL) suggests to learn the active learning strategy itself, allowing it to adapt to the given setting.
We propose a novel LAL method for classification that exploits symmetry and independence properties of the active learning problem.
Our approach is based on learning from a myopic oracle, which gives our model the ability to adapt to non-standard objectives.
arXiv Detail & Related papers (2023-09-11T14:16:37Z) - Model-Based Reinforcement Learning with Multi-Task Offline Pretraining [59.82457030180094]
We present a model-based RL method that learns to transfer potentially useful dynamics and action demonstrations from offline data to a novel task.
The main idea is to use the world models not only as simulators for behavior learning but also as tools to measure the task relevance.
We demonstrate the advantages of our approach compared with the state-of-the-art methods in Meta-World and DeepMind Control Suite.
arXiv Detail & Related papers (2023-06-06T02:24:41Z) - OSCAR: Data-Driven Operational Space Control for Adaptive and Robust
Robot Manipulation [50.59541802645156]
Operational Space Control (OSC) has been used as an effective task-space controller for manipulation.
We propose OSC for Adaptation and Robustness (OSCAR), a data-driven variant of OSC that compensates for modeling errors.
We evaluate our method on a variety of simulated manipulation problems, and find substantial improvements over an array of controller baselines.
arXiv Detail & Related papers (2021-10-02T01:21:38Z) - GEM: Group Enhanced Model for Learning Dynamical Control Systems [78.56159072162103]
We build effective dynamical models that are amenable to sample-based learning.
We show that learning the dynamics on a Lie algebra vector space is more effective than learning a direct state transition model.
This work sheds light on a connection between learning of dynamics and Lie group properties, which opens doors for new research directions.
arXiv Detail & Related papers (2021-04-07T01:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.