Fusion Steering: Prompt-Specific Activation Control
- URL: http://arxiv.org/abs/2505.22572v1
- Date: Wed, 28 May 2025 16:46:55 GMT
- Title: Fusion Steering: Prompt-Specific Activation Control
- Authors: Waldemar Chang, Alhassan Yasin,
- Abstract summary: Fusion Steering improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks.<n>This approach introduces flexible steering configurations, including full-layer steering and segmented steering.<n>Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring $\geq 0.6$), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.
Related papers
- Weight Updates as Activation Shifts: A Principled Framework for Steering [54.70188910511715]
Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices.<n>We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior.<n>This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site.
arXiv Detail & Related papers (2026-02-28T02:50:04Z) - ROAST: Rollout-based On-distribution Activation Steering Technique [16.632201561391366]
Activation steering provides parameter-efficient control over large language models at inference time.<n>We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC.<n>Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality.
arXiv Detail & Related papers (2026-02-15T13:30:26Z) - Steer2Edit: From Activation Steering to Component-Level Editing [24.755027943286432]
We propose Steer2Edit, a training-free framework that transforms steering vectors into diagnostic signals for component rank-1 weight editing.<n>Across safety alignment, attribute mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs.<n>Overall, Steer2Edit provides a principled bridge between representation steering and weight editing.
arXiv Detail & Related papers (2026-02-10T15:15:15Z) - Internalizing LLM Reasoning via Discovery and Replay of Latent Actions [4.830503861275364]
Internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute.<n>We propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem.
arXiv Detail & Related papers (2026-02-04T08:44:57Z) - RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering [62.63376387138257]
We propose a plug-and-play intervention framework that adaptively steers large language models (LLMs) reasoning in activation space.<n>RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input.<n>The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner.
arXiv Detail & Related papers (2026-01-14T08:04:33Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders [63.544453925182005]
We train 90 SAEs across three language models and evaluate their interpretability and steering utility.<n>Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance.<n>We propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution.
arXiv Detail & Related papers (2025-10-04T04:14:50Z) - CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features [1.5874067490843806]
We propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time.<n>Our work establishes correlation-based selection as an effective and scalable approach for automated SAE steering across language model applications.
arXiv Detail & Related papers (2025-08-18T00:01:42Z) - GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering [26.428347164111926]
Inference-time steering aims to alter a large language model's responses without changing its parameters.<n>Existing approaches often rely on simplistic cues or ad hoc generalizations.<n>We introduce REAL, a framework for identifying behavior-relevant modules in Transformer models.
arXiv Detail & Related papers (2025-06-10T02:16:50Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z) - Effectively Steer LLM To Follow Preference via Building Confident Directions [39.40603123075168]
We propose a theoretical framework to understand and quantify the model steering methods.<n>Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations.<n>Our approach offers three key advantages over popular bidirectional model steering methods.
arXiv Detail & Related papers (2025-03-04T20:32:27Z) - Multi-Attribute Steering of Language Models via Targeted Intervention [56.93583799109029]
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction.<n>We introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes.
arXiv Detail & Related papers (2025-02-18T02:27:23Z) - Joint Localization and Activation Editing for Low-Resource Fine-Tuning [73.64004083269424]
We propose a joint localization and activation editing (JoLA) method.<n>JoLA learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves.<n>We demonstrate that JoLA consistently outperforms existing methods.
arXiv Detail & Related papers (2025-02-03T09:13:09Z) - LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models [16.37602070339033]
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs.<n>We propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency.<n>Our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder.
arXiv Detail & Related papers (2025-01-19T13:06:51Z) - Annotator: A Generic Active Learning Baseline for LiDAR Semantic
Segmentation [40.803251337200656]
Annotator is a general and efficient active learning baseline.
voxel-centric online selection strategy is tailored to efficiently probe and annotate the salient and exemplar voxel girds within each LiDAR scan.
Annotator excels in diverse settings, with a particular focus on active learning (AL), active source-free domain adaptation (ASFDA), and active domain adaptation (ADA)
arXiv Detail & Related papers (2023-10-31T09:04:39Z) - RelationTrack: Relation-aware Multiple Object Tracking with Decoupled
Representation [3.356734463419838]
Existing online multiple object tracking (MOT) algorithms often consist of two subtasks, detection and re-identification (ReID)
In order to enhance the inference speed and reduce the complexity, current methods commonly integrate these double subtasks into a unified framework.
We devise a module named Global Context Disentangling (GCD) that decouples the learned representation into detection-specific and ReID-specific embeddings.
To resolve this restriction, we develop a module, referred to as Guided Transformer (GTE), by combining the powerful reasoning ability of Transformer encoder and deformable attention.
arXiv Detail & Related papers (2021-05-10T13:00:40Z) - LiDAR-based Panoptic Segmentation via Dynamic Shifting Network [56.71765153629892]
LiDAR-based panoptic segmentation aims to parse both objects and scenes in a unified manner.
We propose the Dynamic Shifting Network (DS-Net), which serves as an effective panoptic segmentation framework in the point cloud realm.
Our proposed DS-Net achieves superior accuracies over current state-of-the-art methods.
arXiv Detail & Related papers (2020-11-24T08:44:46Z) - AutoAssign: Differentiable Label Assignment for Dense Object Detection [94.24431503373884]
Auto COCO is an anchor-free detector for object detection.
It achieves appearance-aware through a fully differentiable weighting mechanism.
Our best model achieves 52.1% AP, outperforming all existing one-stage detectors.
arXiv Detail & Related papers (2020-07-07T14:32:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.