Related papers: DEAL: Disentangling Transformer Head Activations for LLM Steering

DEAL: Disentangling Transformer Head Activations for LLM Steering

URL: http://arxiv.org/abs/2506.08359v1
Date: Tue, 10 Jun 2025 02:16:50 GMT
Title: DEAL: Disentangling Transformer Head Activations for LLM Steering
Authors: Li-Ming Zhan, Bo Liu, Zexin Lu, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu,
Abstract summary: We propose a principled causal-attribution framework for identifying behavior-relevant attention heads in transformers.<n>For each head, we train a vector-quantized autoencoder (VQ-AE) on its attention activations.<n>We assess the behavioral relevance of each head by the separability of VQ-AE encodings for behavior-aligned versus behavior-violating responses.
Score: 19.770342907146965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inference-time steering aims to alter the response characteristics of large language models (LLMs) without modifying their underlying parameters. A critical step in this process is the identification of internal modules within LLMs that are associated with the target behavior. However, current approaches to module selection often depend on superficial cues or ad-hoc heuristics, which can result in suboptimal or unintended outcomes. In this work, we propose a principled causal-attribution framework for identifying behavior-relevant attention heads in transformers. For each head, we train a vector-quantized autoencoder (VQ-AE) on its attention activations, partitioning the latent space into behavior-relevant and behavior-irrelevant subspaces, each quantized with a shared learnable codebook. We assess the behavioral relevance of each head by quantifying the separability of VQ-AE encodings for behavior-aligned versus behavior-violating responses using a binary classification metric. This yields a behavioral relevance score that reflects each head discriminative capacity with respect to the target behavior, guiding both selection and importance weighting. Experiments on seven LLMs from two model families and five behavioral steering datasets demonstrate that our method enables more accurate inference-time interventions, achieving superior performance on the truthfulness-steering task. Furthermore, the heads selected by our approach exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.

Related papers

Stochastic Encodings for Active Feature Acquisition [100.47043816019888]
Active Feature Acquisition is an instance-wise, sequential decision making problem.<n>The aim is to dynamically select which feature to measure based on current observations, independently for each test instance.<n>Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic.<n>We introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a latent space.
arXiv Detail & Related papers (2025-08-03T23:48:46Z)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
CKAA: Cross-subspace Knowledge Alignment and Aggregation for Robust Continual Learning [80.18781219542016]
Continual Learning (CL) empowers AI models to continuously learn from sequential task streams.<n>Recent parameter-efficient fine-tuning (PEFT)-based CL methods have garnered increasing attention due to their superior performance.<n>We propose Cross-subspace Knowledge Alignment and Aggregation (CKAA) to enhance robustness against misleading task-ids.
arXiv Detail & Related papers (2025-07-13T03:11:35Z)
Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers [3.7132788234059104]
We present causal head gating (CHG), a scalable method for interpreting the functional roles of attention heads in transformer models.<n>CHG learns soft gates over heads and assigns them a causal taxonomy based on their impact on task performance.<n>We show that CHG scores yield causal - not merely correlational - insight, validated via ablation and causal mediation analyses.
arXiv Detail & Related papers (2025-05-19T21:24:13Z)
ExpertSteer: Intervening in LLMs through Expert Knowledge [71.12193680015622]
Activation steering offers a promising method to control the generation process of Large Language Models.<n>We propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors.<n>We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains.
arXiv Detail & Related papers (2025-05-18T08:55:46Z)
Behaviour Discovery and Attribution for Explainable Reinforcement Learning [6.123880364445758]
Building trust in reinforcement learning (RL) agents requires understanding why they make certain decisions.<n>Existing explainability methods often focus on single states or entire trajectories.<n>We propose a fully offline, reward-free framework for behavior discovery and segmentation.
arXiv Detail & Related papers (2025-03-19T08:06:00Z)
Multi-Attribute Steering of Language Models via Targeted Intervention [56.93583799109029]
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction.<n>We introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes.
arXiv Detail & Related papers (2025-02-18T02:27:23Z)
Designing Role Vectors to Improve LLM Inference Behaviour [8.995812770349605]
The influence of personas on Large Language Models (LLMs) has been widely studied, yet their direct impact on performance remains uncertain.<n>This work explores a novel approach to guiding LLM behaviour through role vectors, an alternative to persona-based prompting.
arXiv Detail & Related papers (2025-02-17T17:24:37Z)
Focus On This, Not That! Steering LLMs with Adaptive Feature Specification [48.27684487597968]
Focus Instruction Tuning (FIT) trains large language models to condition their responses by focusing on specific features whilst ignoring others, leading to different behaviours based on what features are specified.<n>We demonstrate that FIT successfully steers behaviour at inference time; (ii) increases robustness by amplifying core task signals and down-weighting spurious cues; (iii) mitigates social bias by suppressing demographic attributes; and (iv) generalises under distribution shifts and to previously unseen focus features.
arXiv Detail & Related papers (2024-10-30T12:01:48Z)
Causality-Aware Transformer Networks for Robotic Navigation [13.719643934968367]
Current research in Visual Navigation reveals opportunities for improvement. Direct adoption of RNNs and Transformers often overlooks the specific differences between Embodied AI and traditional sequential data modelling. We propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module.
arXiv Detail & Related papers (2024-09-04T12:53:26Z)
LoFiT: Localized Fine-tuning on LLM Representations [60.99814930367597]
We introduce a framework called Localized Fine-Tuning on LLM Representations (LoFiT) LoFiT identifies a subset of attention heads that are most important for learning a specific task, then trains offset vectors to add to the model's hidden representations at those selected heads. For truthfulness and reasoning tasks, we find that LoFiT's intervention vectors are more effective for LLM adaptation than vectors from representation intervention methods such as Inference-time Intervention.
arXiv Detail & Related papers (2024-06-03T17:45:41Z)
Value function interference and greedy action selection in value-based multi-objective reinforcement learning [1.4206639868377509]
Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) We show that, if the user's utility function maps widely varying vector-values to similar levels of utility, this can lead to interference. We demonstrate empirically that avoiding the use of random tie-breaking when identifying greedy actions can ameliorate, but not fully overcome, the problems caused by value function interference.
arXiv Detail & Related papers (2024-02-09T09:28:01Z)
Interactive Autonomous Navigation with Internal State Inference and Interactivity Estimation [58.21683603243387]
We propose three auxiliary tasks with relational-temporal reasoning and integrate them into the standard Deep Learning framework. These auxiliary tasks provide additional supervision signals to infer the behavior patterns other interactive agents. Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics.
arXiv Detail & Related papers (2023-11-27T18:57:42Z)
Intuitive or Dependent? Investigating LLMs' Behavior Style to Conflicting Prompts [9.399159332152013]
This study investigates the behaviors of Large Language Models (LLMs) when faced with conflicting prompts versus their internal memory. This will help to understand LLMs' decision mechanism and also benefit real-world applications, such as retrieval-augmented generation (RAG)
arXiv Detail & Related papers (2023-09-29T17:26:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.