Related papers: REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

URL: http://arxiv.org/abs/2506.08359v2
Date: Wed, 01 Oct 2025 16:56:05 GMT
Title: REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering
Authors: Li-Ming Zhan, Bo Liu, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu,
Abstract summary: Inference-time steering aims to alter a large language model's responses without changing its parameters.<n>Existing approaches often rely on simplistic cues or ad hoc generalizations.<n>We introduce REAL, a framework for identifying behavior-relevant modules in Transformer models.
Score: 26.428347164111926
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inference-time steering aims to alter a large language model's (LLM's) responses without changing its parameters, but a central challenge is identifying the internal modules that most strongly govern the target behavior. Existing approaches often rely on simplistic cues or ad hoc heuristics, leading to suboptimal or unintended effects. We introduce REAL, a framework for identifying behavior-relevant modules (attention heads or layers) in Transformer models. For each module, REAL trains a vector-quantized autoencoder (VQ-AE) on its hidden activations and uses a shared, learnable codebook to partition the latent space into behavior-relevant and behavior-irrelevant subspaces. REAL quantifies a module's behavioral relevance by how well its VQ-AE encodings discriminate behavior-aligned from behavior-violating responses via a binary classification metric; this score guides both module selection and steering strength. We evaluate REAL across eight LLMs from the Llama and Qwen families and nine datasets spanning truthfulness enhancement, open-domain QA under knowledge conflicts, and general alignment tasks. REAL enables more effective inference-time interventions, achieving an average relative improvement of 20% (up to 81.5%) over the ITI method on truthfulness steering. In addition, the modules selected by REAL exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.

Related papers

Model Specific Task Similarity for Vision Language Model Selection via Layer Conductance [92.72779885657373]
We propose a framework that grounds model selection in the internal functional dynamics of the visual encoder.<n>Our approach represents each task via layer wise conductance and derives a target-conditioned block importance distribution through entropy regularized alignment.<n>Building on this, we introduce Directional Conductance Divergence (DCD), an asymmetric metric that quantifies how effectively a source task covers the target's salient functional blocks.
arXiv Detail & Related papers (2026-02-01T17:29:43Z)
Source-Free Object Detection with Detection Transformer [59.33653163035064]
Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data.<n>Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR)<n>In this paper, we introduce Feature Reweighting ANd Contrastive Learning NetworK (FRANCK), a novel SFOD framework specifically designed to perform query-centric feature enhancement for DETRs.
arXiv Detail & Related papers (2025-10-13T07:35:04Z)
Attribution-Guided Decoding [24.52258081219335]
We introduce Attribution-Guided Decoding (AGD), an interpretability-based decoding strategy.<n>Instead of directly manipulating model activations, AGD considers a set of high-probability output token candidates.<n>We demonstrate AGD's efficacy across three challenging domains.
arXiv Detail & Related papers (2025-09-30T14:21:40Z)
Stochastic Encodings for Active Feature Acquisition [100.47043816019888]
Active Feature Acquisition is an instance-wise, sequential decision making problem.<n>The aim is to dynamically select which feature to measure based on current observations, independently for each test instance.<n>Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic.<n>We introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a latent space.
arXiv Detail & Related papers (2025-08-03T23:48:46Z)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
CKAA: Cross-subspace Knowledge Alignment and Aggregation for Robust Continual Learning [80.18781219542016]
Continual Learning (CL) empowers AI models to continuously learn from sequential task streams.<n>Recent parameter-efficient fine-tuning (PEFT)-based CL methods have garnered increasing attention due to their superior performance.<n>We propose Cross-subspace Knowledge Alignment and Aggregation (CKAA) to enhance robustness against misleading task-ids.
arXiv Detail & Related papers (2025-07-13T03:11:35Z)
Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers [3.7132788234059104]
We present causal head gating (CHG), a scalable method for interpreting the functional roles of attention heads in transformer models.<n>CHG learns soft gates over heads and assigns them a causal taxonomy based on their impact on task performance.<n>We show that CHG scores yield causal - not merely correlational - insight, validated via ablation and causal mediation analyses.
arXiv Detail & Related papers (2025-05-19T21:24:13Z)
ExpertSteer: Intervening in LLMs through Expert Knowledge [71.12193680015622]
Activation steering offers a promising method to control the generation process of Large Language Models.<n>We propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors.<n>We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains.
arXiv Detail & Related papers (2025-05-18T08:55:46Z)
Behaviour Discovery and Attribution for Explainable Reinforcement Learning [6.123880364445758]
Building trust in reinforcement learning (RL) agents requires understanding why they make certain decisions.<n>Existing explainability methods often focus on single states or entire trajectories.<n>We propose a fully offline, reward-free framework for behavior discovery and segmentation.
arXiv Detail & Related papers (2025-03-19T08:06:00Z)
Steering Large Language Model Activations in Sparse Spaces [21.55545768931058]
A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time.<n>We introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer behavior in sparse spaces.
arXiv Detail & Related papers (2025-02-28T20:43:45Z)
Multi-Attribute Steering of Language Models via Targeted Intervention [56.93583799109029]
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction.<n>We introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes.
arXiv Detail & Related papers (2025-02-18T02:27:23Z)
Designing Role Vectors to Improve LLM Inference Behaviour [8.995812770349605]
The influence of personas on Large Language Models (LLMs) has been widely studied, yet their direct impact on performance remains uncertain.<n>This work explores a novel approach to guiding LLM behaviour through role vectors, an alternative to persona-based prompting.
arXiv Detail & Related papers (2025-02-17T17:24:37Z)
Focus On This, Not That! Steering LLMs with Adaptive Feature Specification [48.27684487597968]
Focus Instruction Tuning (FIT) trains large language models to condition their responses by focusing on specific features whilst ignoring others, leading to different behaviours based on what features are specified.<n>We demonstrate that FIT successfully steers behaviour at inference time; (ii) increases robustness by amplifying core task signals and down-weighting spurious cues; (iii) mitigates social bias by suppressing demographic attributes; and (iv) generalises under distribution shifts and to previously unseen focus features.
arXiv Detail & Related papers (2024-10-30T12:01:48Z)
Causality-Aware Transformer Networks for Robotic Navigation [13.719643934968367]
Current research in Visual Navigation reveals opportunities for improvement. Direct adoption of RNNs and Transformers often overlooks the specific differences between Embodied AI and traditional sequential data modelling. We propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module.
arXiv Detail & Related papers (2024-09-04T12:53:26Z)
LoFiT: Localized Fine-tuning on LLM Representations [60.99814930367597]
We introduce a framework called Localized Fine-Tuning on LLM Representations (LoFiT) LoFiT identifies a subset of attention heads that are most important for learning a specific task, then trains offset vectors to add to the model's hidden representations at those selected heads. For truthfulness and reasoning tasks, we find that LoFiT's intervention vectors are more effective for LLM adaptation than vectors from representation intervention methods such as Inference-time Intervention.
arXiv Detail & Related papers (2024-06-03T17:45:41Z)
Value function interference and greedy action selection in value-based multi-objective reinforcement learning [1.4206639868377509]
Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) We show that, if the user's utility function maps widely varying vector-values to similar levels of utility, this can lead to interference. We demonstrate empirically that avoiding the use of random tie-breaking when identifying greedy actions can ameliorate, but not fully overcome, the problems caused by value function interference.
arXiv Detail & Related papers (2024-02-09T09:28:01Z)
Adaptive Global-Local Representation Learning and Selection for Cross-Domain Facial Expression Recognition [54.334773598942775]
Domain shift poses a significant challenge in Cross-Domain Facial Expression Recognition (CD-FER) We propose an Adaptive Global-Local Representation Learning and Selection framework.
arXiv Detail & Related papers (2024-01-20T02:21:41Z)
Interactive Autonomous Navigation with Internal State Inference and Interactivity Estimation [58.21683603243387]
We propose three auxiliary tasks with relational-temporal reasoning and integrate them into the standard Deep Learning framework. These auxiliary tasks provide additional supervision signals to infer the behavior patterns other interactive agents. Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics.
arXiv Detail & Related papers (2023-11-27T18:57:42Z)
Intuitive or Dependent? Investigating LLMs' Behavior Style to Conflicting Prompts [9.399159332152013]
This study investigates the behaviors of Large Language Models (LLMs) when faced with conflicting prompts versus their internal memory. This will help to understand LLMs' decision mechanism and also benefit real-world applications, such as retrieval-augmented generation (RAG)
arXiv Detail & Related papers (2023-09-29T17:26:03Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.