Related papers: Fusion Steering: Prompt-Specific Activation Control

Fusion Steering: Prompt-Specific Activation Control

URL: http://arxiv.org/abs/2505.22572v1
Date: Wed, 28 May 2025 16:46:55 GMT
Title: Fusion Steering: Prompt-Specific Activation Control
Authors: Waldemar Chang, Alhassan Yasin,
Abstract summary: Fusion Steering improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks.<n>This approach introduces flexible steering configurations, including full-layer steering and segmented steering.<n>Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring $\geq 0.6$), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.

Related papers

Weight Updates as Activation Shifts: A Principled Framework for Steering [54.70188910511715]
Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices.<n>We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior.<n>This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site.
arXiv Detail & Related papers (2026-02-28T02:50:04Z)
ROAST: Rollout-based On-distribution Activation Steering Technique [16.632201561391366]
Activation steering provides parameter-efficient control over large language models at inference time.<n>We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC.<n>Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality.
arXiv Detail & Related papers (2026-02-15T13:30:26Z)
Steer2Edit: From Activation Steering to Component-Level Editing [24.755027943286432]
We propose Steer2Edit, a training-free framework that transforms steering vectors into diagnostic signals for component rank-1 weight editing.<n>Across safety alignment, attribute mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs.<n>Overall, Steer2Edit provides a principled bridge between representation steering and weight editing.
arXiv Detail & Related papers (2026-02-10T15:15:15Z)
Internalizing LLM Reasoning via Discovery and Replay of Latent Actions [4.830503861275364]
Internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute.<n>We propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem.
arXiv Detail & Related papers (2026-02-04T08:44:57Z)
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering [62.63376387138257]
We propose a plug-and-play intervention framework that adaptively steers large language models (LLMs) reasoning in activation space.<n>RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input.<n>The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner.
arXiv Detail & Related papers (2026-01-14T08:04:33Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders [63.544453925182005]
We train 90 SAEs across three language models and evaluate their interpretability and steering utility.<n>Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance.<n>We propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution.
arXiv Detail & Related papers (2025-10-04T04:14:50Z)
CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features [1.5874067490843806]
We propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time.<n>Our work establishes correlation-based selection as an effective and scalable approach for automated SAE steering across language model applications.
arXiv Detail & Related papers (2025-08-18T00:01:42Z)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering [26.428347164111926]
Inference-time steering aims to alter a large language model's responses without changing its parameters.<n>Existing approaches often rely on simplistic cues or ad hoc generalizations.<n>We introduce REAL, a framework for identifying behavior-relevant modules in Transformer models.
arXiv Detail & Related papers (2025-06-10T02:16:50Z)
SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z)
Effectively Steer LLM To Follow Preference via Building Confident Directions [39.40603123075168]
We propose a theoretical framework to understand and quantify the model steering methods.<n>Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations.<n>Our approach offers three key advantages over popular bidirectional model steering methods.
arXiv Detail & Related papers (2025-03-04T20:32:27Z)
Multi-Attribute Steering of Language Models via Targeted Intervention [56.93583799109029]
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction.<n>We introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes.
arXiv Detail & Related papers (2025-02-18T02:27:23Z)
Joint Localization and Activation Editing for Low-Resource Fine-Tuning [73.64004083269424]
We propose a joint localization and activation editing (JoLA) method.<n>JoLA learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves.<n>We demonstrate that JoLA consistently outperforms existing methods.
arXiv Detail & Related papers (2025-02-03T09:13:09Z)
LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models [16.37602070339033]
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs.<n>We propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency.<n>Our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder.
arXiv Detail & Related papers (2025-01-19T13:06:51Z)
Annotator: A Generic Active Learning Baseline for LiDAR Semantic Segmentation [40.803251337200656]
Annotator is a general and efficient active learning baseline. voxel-centric online selection strategy is tailored to efficiently probe and annotate the salient and exemplar voxel girds within each LiDAR scan. Annotator excels in diverse settings, with a particular focus on active learning (AL), active source-free domain adaptation (ASFDA), and active domain adaptation (ADA)
arXiv Detail & Related papers (2023-10-31T09:04:39Z)
RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [3.356734463419838]
Existing online multiple object tracking (MOT) algorithms often consist of two subtasks, detection and re-identification (ReID) In order to enhance the inference speed and reduce the complexity, current methods commonly integrate these double subtasks into a unified framework. We devise a module named Global Context Disentangling (GCD) that decouples the learned representation into detection-specific and ReID-specific embeddings. To resolve this restriction, we develop a module, referred to as Guided Transformer (GTE), by combining the powerful reasoning ability of Transformer encoder and deformable attention.
arXiv Detail & Related papers (2021-05-10T13:00:40Z)
LiDAR-based Panoptic Segmentation via Dynamic Shifting Network [56.71765153629892]
LiDAR-based panoptic segmentation aims to parse both objects and scenes in a unified manner. We propose the Dynamic Shifting Network (DS-Net), which serves as an effective panoptic segmentation framework in the point cloud realm. Our proposed DS-Net achieves superior accuracies over current state-of-the-art methods.
arXiv Detail & Related papers (2020-11-24T08:44:46Z)
AutoAssign: Differentiable Label Assignment for Dense Object Detection [94.24431503373884]
Auto COCO is an anchor-free detector for object detection. It achieves appearance-aware through a fully differentiable weighting mechanism. Our best model achieves 52.1% AP, outperforming all existing one-stage detectors.
arXiv Detail & Related papers (2020-07-07T14:32:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.