Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement
- URL: http://arxiv.org/abs/2509.23799v2
- Date: Fri, 03 Oct 2025 11:34:59 GMT
- Title: Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement
- Authors: Anyi Wang, Xuansheng Wu, Dong Shu, Yunpu Ma, Ninghao Liu,
- Abstract summary: Existing steering methods rely on large-scale datasets to learn clear behavioral information.<n>We introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors.<n>In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features.
- Score: 31.282134977964976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Steering has emerged as a promising approach in controlling large language models (LLMs) without modifying model parameters. However, most existing steering methods rely on large-scale datasets to learn clear behavioral information, which limits their applicability in many real-world scenarios. The steering vectors extracted from small dataset often contain task-irrelevant noising features, which degrades their effectiveness. To refine the steering vectors learned from limited data, we introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors. In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features. Extensive experiments demonstrate that the proposed SAE-RSV substantially outperforms all the baseline methods including supervised fine-tuning. Our findings show that effective steering vector can be constructed from limited training data by refining the original steering vector through SAEs.
Related papers
- Step-Level Sparse Autoencoder for Reasoning Process Interpretation [48.99201531966593]
Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning.<n>We propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features.<n> Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features.
arXiv Detail & Related papers (2026-03-03T14:25:02Z) - One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs [8.089908150148554]
Vision Language Models (VLMs) achieve strong performance on multimodal tasks but still suffer from hallucination and safety-related failures.<n>We propose textbfOSGA (textbfOne-shot textbfSteering with textbfGenerative textbfAnchor), an input-independent framework that improves model performance with a single optimization instance.
arXiv Detail & Related papers (2026-01-30T14:47:59Z) - SteerX: Disentangled Steering for LLM Personalization [75.89038195784701]
Large language models (LLMs) have shown remarkable success in recent years, enabling a wide range of applications.<n>A critical factor in building such assistants is personalizing LLMs, as user preferences and needs vary widely.<n>We propose SteerX, a method that isolates preference-driven components from preference-agnostic components.
arXiv Detail & Related papers (2025-10-25T11:26:20Z) - AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint [49.641959856967276]
We present a theoretically grounded and empirically effective activation steering method called AlphaSteer.<n>For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints.<n>Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer.
arXiv Detail & Related papers (2025-06-08T07:03:28Z) - SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models [41.553639748766784]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.<n>This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces.
arXiv Detail & Related papers (2025-05-22T03:46:57Z) - Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models [48.40096116617163]
Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique.<n>This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets.
arXiv Detail & Related papers (2025-05-21T15:17:59Z) - Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering [41.588589098740755]
Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness.<n>We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations.
arXiv Detail & Related papers (2025-05-21T02:45:11Z) - ExpertSteer: Intervening in LLMs through Expert Knowledge [86.98098988779809]
Activation steering offers a promising method to control the generation process of Large Language Models.<n>We propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors.<n>We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains.
arXiv Detail & Related papers (2025-05-18T08:55:46Z) - Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations [4.029252551781513]
We propose a principled approach for uncovering steering vectors.<n>We focus on extracting latent risk preferences from large language models.<n>We show that the resulting steering vectors successfully and reliably modulate LLM outputs in line with the targeted behavior.
arXiv Detail & Related papers (2025-05-16T18:23:10Z) - Interpretable Steering of Large Language Models with Feature Guided Activation Additions [4.496738719682736]
We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method.<n>By operating in the latent space of a Sparse Autoencoder (SAE), FGAA constructs precise steering vectors.<n> evaluations on Gemma-2-2B and Gemma-2-9B models demonstrate that FGAA outperforms existing steering methods.
arXiv Detail & Related papers (2025-01-17T02:55:23Z) - Improving Steering Vectors by Targeting Sparse Autoencoder Features [2.4188584949331053]
We develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects.
We show that SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
arXiv Detail & Related papers (2024-11-04T15:46:20Z) - Unsupervised Domain Adaptation for Self-Driving from Past Traversal
Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments.
Our approach enhances LiDAR-based detection models using spatial quantized historical features.
Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.