Related papers: Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

URL: http://arxiv.org/abs/2505.20254v1
Date: Mon, 26 May 2025 17:31:36 GMT
Title: Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
Authors: Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang,
Abstract summary: This paper argues that mechanistic interpretability should prioritize feature consistency in SAEs.<n>We propose using the Pairwise Dictionary Mean Correlation Coefficient as a practical metric to operationalize consistency.
Score: 34.52554840674882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.

Related papers

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z)
NDCG-Consistent Softmax Approximation with Accelerated Convergence [67.10365329542365]
We propose novel loss formulations that align directly with ranking metrics.<n>We integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method.<n> Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance.
arXiv Detail & Related papers (2025-06-11T06:59:17Z)
Unveil Sources of Uncertainty: Feature Contribution to Conformal Prediction Intervals [0.3495246564946556]
We propose a novel, model-agnostic uncertainty attribution (UA) method grounded in conformal prediction (CP)<n>We define cooperative games where CP interval properties-such as width and bounds-serve as value functions, we attribute predictive uncertainty to input features.<n>Our experiments on synthetic benchmarks and real-world datasets demonstrate the practical utility and interpretative depth of our approach.
arXiv Detail & Related papers (2025-05-19T13:49:05Z)
SMARTe: Slot-based Method for Accountable Relational Triple extraction [1.2200609701777907]
Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP)<n>We propose SMARTe: a Slot-based Method for Accountable Triple extraction.<n>We demonstrate that adding interpretability does not compromise performance.
arXiv Detail & Related papers (2025-04-17T10:21:15Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
Similarity-Distance-Magnitude Universal Verification [0.0]
We build SDM networks with uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties.<n>We provide open-source software implementing these results.
arXiv Detail & Related papers (2025-02-27T15:05:00Z)
FeaKM: Robust Collaborative Perception under Noisy Pose Conditions [1.9626657740463982]
We introduce FeaKM, a novel method that employs Feature-level Keypoints Matching to correct pose discrepancies among collaborating agents.<n>Our experimental results demonstrate that FeaKM significantly outperforms existing methods on the DAIR-V2X dataset.
arXiv Detail & Related papers (2025-02-16T06:03:33Z)
Are Your LLMs Capable of Stable Reasoning? [38.03049704515947]
We introduce G-Pass@$k$, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts.<n>We employ G-Pass@$k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency.
arXiv Detail & Related papers (2024-12-17T18:12:47Z)
TRACE: TRansformer-based Attribution using Contrastive Embeddings in LLMs [50.259001311894295]
We propose a novel TRansformer-based Attribution framework using Contrastive Embeddings called TRACE. We show that TRACE significantly improves the ability to attribute sources accurately, making it a valuable tool for enhancing the reliability and trustworthiness of large language models.
arXiv Detail & Related papers (2024-07-06T07:19:30Z)
Federated Contrastive Learning for Personalized Semantic Communication [55.46383524190467]
We design a federated contrastive learning framework aimed at supporting personalized semantic communication. FedCL enables collaborative training of local semantic encoders across multiple clients and a global semantic decoder owned by the base station. To tackle the semantic imbalance issue arising from heterogeneous datasets across distributed clients, we employ contrastive learning to train a semantic centroid generator.
arXiv Detail & Related papers (2024-06-13T14:45:35Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
Robust Learning Through Cross-Task Consistency [92.42534246652062]
We propose a broadly applicable and fully computational method for augmenting learning with Cross-Task Consistency. We observe that learning with cross-task consistency leads to more accurate predictions and better generalization to out-of-distribution inputs.
arXiv Detail & Related papers (2020-06-07T09:24:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.