Related papers: Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement

Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement

URL: http://arxiv.org/abs/2509.23799v2
Date: Fri, 03 Oct 2025 11:34:59 GMT
Title: Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement
Authors: Anyi Wang, Xuansheng Wu, Dong Shu, Yunpu Ma, Ninghao Liu,
Abstract summary: Existing steering methods rely on large-scale datasets to learn clear behavioral information.<n>We introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors.<n>In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features.
Score: 31.282134977964976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Steering has emerged as a promising approach in controlling large language models (LLMs) without modifying model parameters. However, most existing steering methods rely on large-scale datasets to learn clear behavioral information, which limits their applicability in many real-world scenarios. The steering vectors extracted from small dataset often contain task-irrelevant noising features, which degrades their effectiveness. To refine the steering vectors learned from limited data, we introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors. In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features. Extensive experiments demonstrate that the proposed SAE-RSV substantially outperforms all the baseline methods including supervised fine-tuning. Our findings show that effective steering vector can be constructed from limited training data by refining the original steering vector through SAEs.

Related papers

Step-Level Sparse Autoencoder for Reasoning Process Interpretation [48.99201531966593]
Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning.<n>We propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features.<n> Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features.
arXiv Detail & Related papers (2026-03-03T14:25:02Z)
One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs [8.089908150148554]
Vision Language Models (VLMs) achieve strong performance on multimodal tasks but still suffer from hallucination and safety-related failures.<n>We propose textbfOSGA (textbfOne-shot textbfSteering with textbfGenerative textbfAnchor), an input-independent framework that improves model performance with a single optimization instance.
arXiv Detail & Related papers (2026-01-30T14:47:59Z)
SteerX: Disentangled Steering for LLM Personalization [75.89038195784701]
Large language models (LLMs) have shown remarkable success in recent years, enabling a wide range of applications.<n>A critical factor in building such assistants is personalizing LLMs, as user preferences and needs vary widely.<n>We propose SteerX, a method that isolates preference-driven components from preference-agnostic components.
arXiv Detail & Related papers (2025-10-25T11:26:20Z)
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint [49.641959856967276]
We present a theoretically grounded and empirically effective activation steering method called AlphaSteer.<n>For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints.<n>Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer.
arXiv Detail & Related papers (2025-06-08T07:03:28Z)
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models [41.553639748766784]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.<n>This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces.
arXiv Detail & Related papers (2025-05-22T03:46:57Z)
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models [48.40096116617163]
Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique.<n>This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets.
arXiv Detail & Related papers (2025-05-21T15:17:59Z)
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering [41.588589098740755]
Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness.<n>We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations.
arXiv Detail & Related papers (2025-05-21T02:45:11Z)
ExpertSteer: Intervening in LLMs through Expert Knowledge [86.98098988779809]
Activation steering offers a promising method to control the generation process of Large Language Models.<n>We propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors.<n>We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains.
arXiv Detail & Related papers (2025-05-18T08:55:46Z)
Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations [4.029252551781513]
We propose a principled approach for uncovering steering vectors.<n>We focus on extracting latent risk preferences from large language models.<n>We show that the resulting steering vectors successfully and reliably modulate LLM outputs in line with the targeted behavior.
arXiv Detail & Related papers (2025-05-16T18:23:10Z)
Interpretable Steering of Large Language Models with Feature Guided Activation Additions [4.496738719682736]
We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method.<n>By operating in the latent space of a Sparse Autoencoder (SAE), FGAA constructs precise steering vectors.<n> evaluations on Gemma-2-2B and Gemma-2-9B models demonstrate that FGAA outperforms existing steering methods.
arXiv Detail & Related papers (2025-01-17T02:55:23Z)
Improving Steering Vectors by Targeting Sparse Autoencoder Features [2.4188584949331053]
We develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects. We show that SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
arXiv Detail & Related papers (2024-11-04T15:46:20Z)
Unsupervised Domain Adaptation for Self-Driving from Past Traversal Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments. Our approach enhances LiDAR-based detection models using spatial quantized historical features. Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.