DISCO: Disentangled Communication Steering for Large Language Models
- URL: http://arxiv.org/abs/2509.16820v1
- Date: Sat, 20 Sep 2025 21:56:03 GMT
- Title: DISCO: Disentangled Communication Steering for Large Language Models
- Authors: Max Torop, Aria Masoomi, Masih Eskandar, Jennifer Dy,
- Abstract summary: We propose to inject steering vectors directly into the query and value representation spaces within attention heads.<n>We analytically characterize the effect of our method, which we term DISentangled COmmunication (DISCO) Steering, on attention head outputs.
- Score: 3.4065590965511436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A variety of recent methods guide large language model outputs via the inference-time addition of steering vectors to residual-stream or attention-head representations. In contrast, we propose to inject steering vectors directly into the query and value representation spaces within attention heads. We provide evidence that a greater portion of these spaces exhibit high linear discriminability of concepts --a key property motivating the use of steering vectors-- than attention head outputs. We analytically characterize the effect of our method, which we term DISentangled COmmunication (DISCO) Steering, on attention head outputs. Our analysis reveals that DISCO disentangles a strong but underutilized baseline, steering attention inputs, which implicitly modifies queries and values in a rigid manner. In contrast, DISCO's direct modulation of these components enables more granular control. We find that DISCO achieves superior performance over a number of steering vector baselines across multiple datasets on LLaMA 3.1 8B and Gemma 2 9B, with steering efficacy scoring up to 19.1% higher than the runner-up. Our results support the conclusion that the query and value spaces are powerful building blocks for steering vector methods.
Related papers
- Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations [0.0]
I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data.<n>I find that higher cosine similarity between training activation differences predicts more reliable steering.<n>I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable.
arXiv Detail & Related papers (2026-02-19T22:37:05Z) - Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics [81.80010043113445]
Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
arXiv Detail & Related papers (2026-02-02T17:04:36Z) - Mechanistic Indicators of Steering Effectiveness in Large Language Models [3.635648354808971]
Activation-based steering enables Large Language Models to exhibit targeted behaviors by intervening on intermediate activations without retraining.<n>Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood.<n>We investigate whether the reliability of steering can be diagnosed using internal model signals.
arXiv Detail & Related papers (2026-02-02T06:56:22Z) - Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement [31.282134977964976]
Existing steering methods rely on large-scale datasets to learn clear behavioral information.<n>We introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors.<n>In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features.
arXiv Detail & Related papers (2025-09-28T10:49:22Z) - KV Cache Steering for Controlling Frozen LLMs [80.50365534625438]
cache steering is a lightweight method for implicit steering of language models.<n>We apply cache steering to induce chain-of-thought reasoning in small language models.
arXiv Detail & Related papers (2025-07-11T17:59:36Z) - SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models [41.553639748766784]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.<n>This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces.
arXiv Detail & Related papers (2025-05-22T03:46:57Z) - Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering [41.588589098740755]
Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness.<n>We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations.
arXiv Detail & Related papers (2025-05-21T02:45:11Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z) - Analyzing the Generalization and Reliability of Steering Vectors [8.253773195379166]
We show that steering vectors have substantial limitations both in- and out-of-distribution.<n>In-distribution, steerability is highly variable across different inputs.<n>Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt.
arXiv Detail & Related papers (2024-07-17T08:32:03Z) - Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data.
This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization.
Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z) - Benchmarking the Robustness of LiDAR Semantic Segmentation Models [78.6597530416523]
In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions.
We propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy.
We design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications.
arXiv Detail & Related papers (2023-01-03T06:47:31Z) - Detecting Rotated Objects as Gaussian Distributions and Its 3-D
Generalization [81.29406957201458]
Existing detection methods commonly use a parameterized bounding box (BBox) to model and detect (horizontal) objects.
We argue that such a mechanism has fundamental limitations in building an effective regression loss for rotation detection.
We propose to model the rotated objects as Gaussian distributions.
We extend our approach from 2-D to 3-D with a tailored algorithm design to handle the heading estimation.
arXiv Detail & Related papers (2022-09-22T07:50:48Z) - Anchor-free Oriented Proposal Generator for Object Detection [59.54125119453818]
Oriented object detection is a practical and challenging task in remote sensing image interpretation.
Nowadays, oriented detectors mostly use horizontal boxes as intermedium to derive oriented boxes from them.
We propose a novel Anchor-free Oriented Proposal Generator (AOPG) that abandons the horizontal boxes-related operations from the network architecture.
arXiv Detail & Related papers (2021-10-05T10:45:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.