Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
- URL: http://arxiv.org/abs/2505.15038v2
- Date: Tue, 29 Jul 2025 21:40:42 GMT
- Title: Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
- Authors: Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du,
- Abstract summary: Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness.<n>We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations.
- Score: 41.588589098740755
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16\% across six challenging concepts, while maintaining topic relevance.
Related papers
- Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z) - Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts [11.81523319216474]
Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties.<n>Traditionally, steering has relied on supervision, such as from contrastive pairs of prompts that vary in a single target concept.<n>We introduce Sparse Shift Autoencoders (SSAEs) that instead map the differences between embeddings to sparse representations.
arXiv Detail & Related papers (2025-02-14T08:49:41Z) - Pivotal Auto-Encoder via Self-Normalizing ReLU [20.76999663290342]
We formalize single hidden layer sparse auto-encoders as a transform learning problem.
We propose an optimization problem that leads to a predictive model invariant to the noise level at test time.
Our experimental results demonstrate that the trained models yield a significant improvement in stability against varying types of noise.
arXiv Detail & Related papers (2024-06-23T09:06:52Z) - Vector Quantized Wasserstein Auto-Encoder [57.29764749855623]
We study learning deep discrete representations from the generative viewpoint.
We endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution.
We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution.
arXiv Detail & Related papers (2023-02-12T13:51:36Z) - Covariate-informed Representation Learning with Samplewise Optimal
Identifiable Variational Autoencoders [15.254297587065595]
Recently proposed identifiable variational autoencoder (iVAE) provides a promising approach for learning latent independent components of the data.
We develop a new approach, co-informed identifiable VAE (CI-iVAE)
In doing so, the objective function enforces the inverse relation, and learned representation contains more information of observations.
arXiv Detail & Related papers (2022-02-09T00:18:33Z) - Navigating Neural Space: Revisiting Concept Activation Vectors to Overcome Directional Divergence [13.618809162030486]
Concept Activation Vectors (CAVs) have emerged as a popular tool for modeling human-understandable concepts in the latent space.<n>In this paper we show that such a separability-oriented leads to solutions, which may diverge from the actual goal of precisely modeling the concept direction.<n>We introduce pattern-based CAVs, solely focussing on concept signals, thereby providing more accurate concept directions.
arXiv Detail & Related papers (2022-02-07T19:40:20Z) - Adaptive Discrete Communication Bottlenecks with Dynamic Vector
Quantization [76.68866368409216]
We propose learning to dynamically select discretization tightness conditioned on inputs.
We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.
arXiv Detail & Related papers (2022-02-02T23:54:26Z) - InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via
Intermediary Latents [60.785317191131284]
We introduce a simple and effective method for learning VAEs with controllable biases by using an intermediary set of latent variables.
In particular, it allows us to impose desired properties like sparsity or clustering on learned representations.
We show that this, in turn, allows InteL-VAEs to learn both better generative models and representations.
arXiv Detail & Related papers (2021-06-25T16:34:05Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Learning Noise-Aware Encoder-Decoder from Noisy Labels by Alternating
Back-Propagation for Saliency Detection [54.98042023365694]
We propose a noise-aware encoder-decoder framework to disentangle a clean saliency predictor from noisy training examples.
The proposed model consists of two sub-models parameterized by neural networks.
arXiv Detail & Related papers (2020-07-23T18:47:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.