Related papers: Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Identifying Intervenable and Interpretable Features via Orthogonality Regularization

URL: http://arxiv.org/abs/2602.04718v1
Date: Wed, 04 Feb 2026 16:29:14 GMT
Title: Identifying Intervenable and Interpretable Features via Orthogonality Regularization
Authors: Moritz Miller, Florent Draye, Bernhard Schölkopf,
Abstract summary: We disentangle the decoder matrix into almost orthogonal features.<n>This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged.<n>Our code is available under $texttthttps://github.com/mrtzmllr/sae-icm$.
Score: 48.938969291033665
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the $\textit{Independent Causal Mechanisms}$ principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under $\texttt{https://github.com/mrtzmllr/sae-icm}$.

Related papers

MirrorLA: Reflecting Feature Map for Vision Linear Attention [49.41670925034762]
Linear attention significantly reduces the computational complexity of Transformers from quadratic to linear, yet it consistently lags behind softmax-based attention in performance.<n>We propose MirrorLA, a geometric framework that substitutes passive truncation with active reorientation.<n>MirrorLA achieves state-of-the-art performance across standard benchmarks, demonstrating that strictly linear efficiency can be achieved without compromising representational fidelity.
arXiv Detail & Related papers (2026-02-04T09:14:09Z)
BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features [6.495737609776765]
Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective.<n>We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior.
arXiv Detail & Related papers (2026-01-31T15:11:05Z)
Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention [0.0]
Zonkey is a hierarchical diffusion model that addresses limitations through a fully trainable pipeline from raw characters to document-level representations.<n>At its core is a differentiable tokenizer that learns probabilistic beginning-of-sequence (BOS) decisions.<n>Zonkey generates coherent, variable-length text from noise, demonstrating emergent hierarchies.
arXiv Detail & Related papers (2026-01-29T14:17:37Z)
Sculpting Latent Spaces With MMD: Disentanglement With Programmable Priors [30.182736043604304]
We introduce the Programmable Prior Framework, a method built on the Maximum Mean Discrepancy (MMD)<n>Our work provides a foundational tool for representation engineering, opening new avenues for model identifiability and causal reasoning.
arXiv Detail & Related papers (2025-10-13T21:26:01Z)
Streaming Private Continual Counting via Binning [11.72102598708538]
We present a simple approach to approximating factorization mechanisms in low space via $textitbinning$.<n>We show empirically that even with very low space usage we are able to closely match, and sometimes surpass, the performance of optimal factorization mechanisms.
arXiv Detail & Related papers (2024-12-10T01:21:56Z)
S-CFE: Simple Counterfactual Explanations [22.262567049579648]
We tackle the problem of finding manifold-aligned counterfactual explanations for sparse data.<n>Our approach effectively produces sparse, manifold-aligned counterfactual explanations.
arXiv Detail & Related papers (2024-10-21T07:42:43Z)
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.<n>These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z)
On the Stability of Expressive Positional Encodings for Graphs [46.967035678550594]
Using Laplacian eigenvectors as positional encodings faces two fundamental challenges. We introduce Stable and Expressive Positional generalizations (SPE) SPE is the first architecture that is (1) provably stable, and (2) universally expressive for basis invariant functions.
arXiv Detail & Related papers (2023-10-04T04:48:55Z)
Sparse Quadratic Optimisation over the Stiefel Manifold with Application to Permutation Synchronisation [71.27989298860481]
We address the non- optimisation problem of finding a matrix on the Stiefel manifold that maximises a quadratic objective function. We propose a simple yet effective sparsity-promoting algorithm for finding the dominant eigenspace matrix.
arXiv Detail & Related papers (2021-09-30T19:17:35Z)
Can contrastive learning avoid shortcut solutions? [88.249082564465]
implicit feature modification (IFM) is a method for altering positive and negative samples in order to guide contrastive models towards capturing a wider variety of predictive features. IFM reduces feature suppression, and as a result improves performance on vision and medical imaging tasks.
arXiv Detail & Related papers (2021-06-21T16:22:43Z)
Discrete Variational Attention Models for Language Generation [51.88612022940496]
We propose a discrete variational attention model with categorical distribution over the attention mechanism owing to the discrete nature in languages. Thanks to the property of discreteness, the training of our proposed approach does not suffer from posterior collapse.
arXiv Detail & Related papers (2020-04-21T05:49:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.