Related papers: Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features

Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features

URL: http://arxiv.org/abs/2601.22447v1
Date: Fri, 30 Jan 2026 01:30:48 GMT
Title: Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features
Authors: Yiting Liu, Zhi-Hong Deng,
Abstract summary: Current interpretation methods infer feature semantics from activation patterns, but overlook that features are trained to reconstruct activations that serve computational roles in the forward pass.<n>We introduce a novel weight-based interpretation framework that measures functional effects through direct weight interactions, requiring no activation data.
Score: 11.463277740376236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) have emerged as a powerful technique for decomposing language model representations into interpretable features. Current interpretation methods infer feature semantics from activation patterns, but overlook that features are trained to reconstruct activations that serve computational roles in the forward pass. We introduce a novel weight-based interpretation framework that measures functional effects through direct weight interactions, requiring no activation data. Through three experiments on Gemma-2 and Llama-3.1 models, we demonstrate that (1) 1/4 of features directly predict output tokens, (2) features actively participate in attention mechanisms with depth-dependent structure, and (3) semantic and non-semantic feature populations exhibit distinct distribution profiles in attention circuits. Our analysis provides the missing out-of-context half of SAE feature interpretability.

Related papers

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features [1.5874067490843806]
Control Reinforcement Learning trains a policy to select SAE features for steering at each token, producing interpretable intervention logs.<n> Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability.<n>On Gemma 2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs.
arXiv Detail & Related papers (2026-02-11T02:28:49Z)
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders [8.188989044347595]
We propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features.<n>Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior.
arXiv Detail & Related papers (2026-01-06T12:40:37Z)
Circuit Insights: Towards Interpretability Beyond Activations [20.178085579725472]
We propose WeightLens and CircuitLens, two complementary methods for mechanistic interpretability.<n>WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets.<n> CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics.
arXiv Detail & Related papers (2025-10-16T17:49:41Z)
Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition [51.26760289602137]
We show that a function induction mechanism explains the model's generalization from standard addition to off-by-one addition.<n>This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction.<n>We find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition.
arXiv Detail & Related papers (2025-07-14T03:20:55Z)
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization [17.101290138120564]
Current methods rely on dictionary learning with sparse autoencoders (SAEs)<n>Here, we tackle these limitations by directly decomposing activations with semi-nonnegative matrix factorization (SNMF)<n>Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering.
arXiv Detail & Related papers (2025-06-12T17:33:29Z)
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models [3.8498574327875947]
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models.<n>By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage.
arXiv Detail & Related papers (2025-02-05T09:39:34Z)
X2-DFD: A framework for eXplainable and eXtendable Deepfake Detection [55.77552681618732]
X2-DFD is an eXplainable and eXtendable framework based on multimodal large-language models (MLLMs) for deepfake detection.<n>The first stage, Model Feature Assessment, systematically evaluates the detectability of forgery-related features for the MLLM.<n>The second stage, Explainable dataset Construction, consists of two key modules: Strong Feature Strengthening and Weak Feature Supplementing.<n>The third stage, Fine-tuning and Inference, involves fine-tuning the MLLM on the constructed dataset and deploying it for final detection and explanation.
arXiv Detail & Related papers (2024-10-08T15:28:33Z)
Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers. We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z)
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.<n>These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z)
Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning. Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z)
Transforming Feature Space to Interpret Machine Learning Models [91.62936410696409]
This contribution proposes a novel approach that interprets machine-learning models through the lens of feature space transformations. It can be used to enhance unconditional as well as conditional post-hoc diagnostic tools. A case study on remote-sensing landcover classification with 46 features is used to demonstrate the potential of the proposed approach.
arXiv Detail & Related papers (2021-04-09T10:48:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.