Related papers: FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies

FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies

URL: http://arxiv.org/abs/2506.17673v1
Date: Sat, 21 Jun 2025 10:18:25 GMT
Title: FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies
Authors: Seonglae Cho, Harryn Oh, Donghyun Lee, Luis Eduardo Rodrigues Vieira, Andrew Bermingham, Ziad El Sayed,
Abstract summary: We propose FaithfulSAE, a method that trains SAEs on the model's own synthetic dataset.<n>We demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds.
Score: 3.709351921096894
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets - either collected from the Web or generated by another model - which may contain out-of-distribution (OOD) data beyond the model's generalisation capabilities. This can result in hallucinated SAE features, which we term "Fake Features", that misrepresent the model's internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model's own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds. Notably, FaithfulSAEs outperform SAEs trained on web-based datasets in the SAE probing task and exhibit a lower Fake Feature Ratio in 5 out of 7 models. Overall, our approach eliminates the dependency on external datasets, advancing interpretability by better capturing model-internal features while highlighting the often neglected importance of SAE training datasets.

Related papers

Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z)
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models [50.587868616659826]
We introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations.<n>Our experimental results reveal that SAEs trained on Vision-Language Models significantly enhance the monosemanticity of individual neurons.
arXiv Detail & Related papers (2025-04-03T17:58:35Z)
Do Sparse Autoencoders Generalize? A Case Study of Answerability [12.131254862319865]
We evaluate SAE feature generalization across diverse answerability datasets for Gemma 2 SAEs.<n>Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply.
arXiv Detail & Related papers (2025-02-27T10:45:25Z)
Sparse Autoencoders Trained on the Same Data Learn Different Features [0.7234862895932991]
Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in large language models.<n>Our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features.
arXiv Detail & Related papers (2025-01-28T01:24:16Z)
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders [8.003244901104111]
We propose a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. textscMFR can improve the reconstruction loss of SAEs by up to 21.21% on GPT-2 Small, and 6.67% on EEG data.
arXiv Detail & Related papers (2024-11-02T11:42:23Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Distribution Guided Active Feature Acquisition [14.279123976398926]
We develop an active feature acquisition framework that interacts with the environment to obtain new information on-the-fly. We build our AFA framework on a backbone of understanding the information and conditional dependencies that are present in the data. We show that it is possible to guide the training of RL agents for AFA via side-information and auxiliary rewards stemming from our generative models.
arXiv Detail & Related papers (2024-10-04T20:38:30Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models. This synthetic data is employed to evaluate the robustness of pretrained segmenters. We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z)
Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other. We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z)
FedFed: Feature Distillation against Data Heterogeneity in Federated Learning [88.36513907827552]
Federated learning (FL) typically faces data heterogeneity, i.e., distribution shifting among clients. We propose a novel approach called textbfFederated textbfFeature textbfdistillation (FedFedFed) FedFed partitions data into performance-sensitive features (i.e., greatly contributing to model performance) and performance-robust features (i.e., limitedly contributing to model performance) Comprehensive experiments demonstrate the efficacy of FedFed in promoting model performance.
arXiv Detail & Related papers (2023-10-08T09:00:59Z)
Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs [47.14410674505256]
We present a case study of syntax acquisition in masked language models (MLMs)<n>We study Syntactic Attention Structure (SAS), a naturally emerging property of accessibles wherein specific Transformer heads tend to focus on specific syntactic relations.<n>We examine the causal role of SAS by manipulating SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities.
arXiv Detail & Related papers (2023-09-13T20:57:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.