Related papers: Sparse Autoencoders Trained on the Same Data Learn Different Features

Related papers

SWE-RM: Execution-free Feedback For Software Engineering Agents [61.86380395896069]
Execution-based feedback is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL)<n>In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases.<n>We introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference.
arXiv Detail & Related papers (2025-12-26T08:26:18Z)
Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit [16.056849135589324]
Analyzing large-scale text corpora is a core challenge in machine learning.<n>We propose using sparse autoencoders (SAEs) to create SAE embeddings.<n>We show that SAE embeddings are more cost-effective and reliable than LLMs and more controllable than dense embeddings.
arXiv Detail & Related papers (2025-12-10T21:26:24Z)
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders [63.544453925182005]
We train 90 SAEs across three language models and evaluate their interpretability and steering utility.<n>Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance.<n>We propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution.
arXiv Detail & Related papers (2025-10-04T04:14:50Z)
FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies [3.709351921096894]
We propose FaithfulSAE, a method that trains SAEs on the model's own synthetic dataset.<n>We demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds.
arXiv Detail & Related papers (2025-06-21T10:18:25Z)
Dense SAE Latents Are Features, Not Bugs [75.08462524662072]
We show that dense latents serve functional roles in language model computation.<n>We identify classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction.
arXiv Detail & Related papers (2025-06-18T17:59:35Z)
Transferring Features Across Language Models With Model Stitching [61.24716360332365]
We show that affine mappings between residual streams of language models is a cheap way to transfer represented features between models.<n>We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings.
arXiv Detail & Related papers (2025-06-07T01:03:25Z)
Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z)
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models [50.587868616659826]
We introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations.<n>Our experimental results reveal that SAEs trained on Vision-Language Models significantly enhance the monosemanticity of individual neurons.
arXiv Detail & Related papers (2025-04-03T17:58:35Z)
Route Sparse Autoencoder to Interpret Large Language Models [33.44362399988847]
Route Sparse Autoencoder (RouteSAE) is a framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. Under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score.
arXiv Detail & Related papers (2025-03-11T09:08:07Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space.<n>We build an open-source pipeline to generate and evaluate natural language explanations for SAE features.<n>Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z)
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models [14.594698598522797]
Demonstrating feature universality allows discoveries about latent representations to generalize across several models. We employ a method known as dictionary learning to transform LLM activations into interpretable spaces spanned by neurons corresponding to individual features. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
arXiv Detail & Related papers (2024-10-09T15:18:57Z)
LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z)
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning [0.9374652839580183]
Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. We propose end-to-end sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.
arXiv Detail & Related papers (2024-05-17T17:03:46Z)
Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs [50.5783641817253]
We present a case study of syntax acquisition in masked language models (MLMs) We study Syntactic Attention Structure (SAS), a naturally emerging property of accessibles wherein specific Transformer heads tend to focus on specific syntactic relations. We examine the causal role of SAS by manipulating SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities.
arXiv Detail & Related papers (2023-09-13T20:57:11Z)
Self-Supervised Learning for Invariant Representations from Multi-Spectral and SAR Images [5.994412766684843]
Self-Supervised learning (SSL) has become the new state-of-art in several domain classification and segmentation tasks. This work proposes RSDnet, which applies the distillation network (BYOL) in the remote sensing (RS) domain.
arXiv Detail & Related papers (2022-05-04T13:16:48Z)
Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning [73.75457731689858]
We develop a computation efficient yet accurate network based on the proposed attentive auxiliary features (A$2$F) for SISR. Experimental results on large-scale dataset demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods.
arXiv Detail & Related papers (2020-11-13T06:01:46Z)
Adversarial Feature Hallucination Networks for Few-Shot Learning [84.31660118264514]
Adversarial Feature Hallucination Networks (AFHN) is based on conditional Wasserstein Generative Adversarial networks (cWGAN) Two novel regularizers are incorporated into AFHN to encourage discriminability and diversity of the synthesized features.
arXiv Detail & Related papers (2020-03-30T02:43:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.