Related papers: Analysis of Variational Sparse Autoencoders

Analysis of Variational Sparse Autoencoders

URL: http://arxiv.org/abs/2509.22994v2
Date: Wed, 01 Oct 2025 20:44:58 GMT
Title: Analysis of Variational Sparse Autoencoders
Authors: Zachary Baker, Yuxiao Li,
Abstract summary: We investigate whether incorporating variational methods into SAE architectures can improve feature organization and interpretability.<n>We introduce the Variational Sparse Autoencoder (vSAE), which replaces deterministic ReLU gating with sampling from learned Gaussian posteriors.<n>Our findings suggest that naive application of variational methods to SAEs does not improve feature organization or interpretability.
Score: 1.675385127117872
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Autoencoders (SAEs) have emerged as a promising approach for interpreting neural network representations by learning sparse, human-interpretable features from dense activations. We investigate whether incorporating variational methods into SAE architectures can improve feature organization and interpretability. We introduce the Variational Sparse Autoencoder (vSAE), which replaces deterministic ReLU gating with stochastic sampling from learned Gaussian posteriors and incorporates KL divergence regularization toward a standard normal prior. Our hypothesis is that this probabilistic sampling creates dispersive pressure, causing features to organize more coherently in the latent space while avoiding overlap. We evaluate a TopK vSAE against a standard TopK SAE on Pythia-70M transformer residual stream activations using comprehensive benchmarks including SAE Bench, individual feature interpretability analysis, and global latent space visualization through t-SNE. The vSAE underperforms standard SAE across core evaluation metrics, though excels at feature independence and ablation metrics. The KL divergence term creates excessive regularization pressure that substantially reduces the fraction of living features, leading to observed performance degradation. While vSAE features demonstrate improved robustness, they exhibit many more dead features than baseline. Our findings suggest that naive application of variational methods to SAEs does not improve feature organization or interpretability.

Related papers

Feature-Space Adversarial Robustness Certification for Multimodal Large Language Models [59.6491828112519]
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications.<n> MLLMs are vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions.<n>We propose Feature-space Smoothing (FS), a general framework that provides certified robustness guarantees at the feature representation level of MLLMs.
arXiv Detail & Related papers (2026-01-22T18:52:21Z)
Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection [62.89126207012712]
Face Presentation Attack Detection (PAD) demands incremental learning to combat spoofing tactics and domains.<n>Privacy regulations forbid retaining past data, necessitating rehearsal-free learning (RF-IL)
arXiv Detail & Related papers (2025-12-22T04:30:11Z)
ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders [30.219733023958188]
Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models.<n>We propose a semantically-guided SAE, called ProtSAE.<n>We show that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods.
arXiv Detail & Related papers (2025-08-26T11:20:31Z)
Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z)
Probabilistic Variational Contrastive Learning [2.512491726995032]
We propose a decoder-free framework that maximizes the evidence lower bound (ELBO)<n>We model the approximate posterior $q_theta(z|x)$ as a projected normal distribution, enabling the sampling of probabilistic embeddings.
arXiv Detail & Related papers (2025-06-11T20:26:07Z)
Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z)
Do Sparse Autoencoders Generalize? A Case Study of Answerability [32.356991861926105]
We evaluate SAE feature generalization across diverse, partly self-constructed answerability datasets for Gemma 2 SAEs.<n>Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply.
arXiv Detail & Related papers (2025-02-27T10:45:25Z)
Source-Free Domain Adaptive Object Detection with Semantics Compensation [54.00183496587841]
We introduce Weak-to-strong Semantics Compensation (WSCo) for strong data augmentation.<n>WSCo compensates for the class-relevant semantics that may be lost during strong augmentation on the fly.<n>WSCo can be implemented as a generic plug-in, easily integrable with any existing SFOD pipelines.
arXiv Detail & Related papers (2024-10-07T23:32:06Z)
PseudoNeg-MAE: Self-Supervised Point Cloud Learning using Conditional Pseudo-Negative Embeddings [55.55445978692678]
PseudoNeg-MAE enhances global feature representation of point cloud masked autoencoders by making them both discriminative and sensitive to transformations.<n>We propose a novel loss that explicitly penalizes invariant collapse, enabling the network to capture richer transformation cues while preserving discriminative representations.
arXiv Detail & Related papers (2024-09-24T07:57:21Z)
Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal Approach [51.012396632595554]
Invariant representation learning (IRL) encourages the prediction from invariant causal features to labels de-confounded from the environments. Recent theoretical results verified that some causal features recovered by IRLs merely pretend domain-invariantly in the training environments but fail in unseen domains. We develop an approach based on conditional mutual information with respect to RS-SCM, then rigorously rectify the spurious and fake invariant effects.
arXiv Detail & Related papers (2023-12-15T12:58:05Z)
An Information-Theoretic Perspective on Variance-Invariance-Covariance Regularization [52.44068740462729]
We present an information-theoretic perspective on the VICReg objective. We derive a generalization bound for VICReg, revealing its inherent advantages for downstream tasks. We introduce a family of SSL methods derived from information-theoretic principles that outperform existing SSL techniques.
arXiv Detail & Related papers (2023-03-01T16:36:25Z)
Regularizing Variational Autoencoder with Diversity and Uncertainty Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference. We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.