Related papers: Sparse Attention Post-Training for Mechanistic Interpretability

Sparse Attention Post-Training for Mechanistic Interpretability

URL: http://arxiv.org/abs/2512.05865v1
Date: Fri, 05 Dec 2025 16:40:08 GMT
Title: Sparse Attention Post-Training for Mechanistic Interpretability
Authors: Florent Draye, Anson Lei, Ingmar Posner, Bernhard Schölkopf,
Abstract summary: We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance.<n>Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $approx 0.3 % of its edges.
Score: 55.030850996535776
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.3 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

Related papers

Bridging Training and Merging Through Momentum-Aware Optimization [8.035521056416242]
Training large neural networks and task-specific computation models require parameter importance estimation.<n>Current compute curvature information during training, discard it, then recompute similar information for merging.<n>We introduce a unified framework that factorized momentum and curvature statistics during training, then recompute similar information for merging.
arXiv Detail & Related papers (2025-12-18T22:37:33Z)
Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank [65.00301565190824]
mname is a plug-and-play training framework that requires no external encoders.<n>mname achieves a state-of-the-art FID of textbf2.40 within 400k steps, significantly outperforming comparable methods.
arXiv Detail & Related papers (2025-12-09T14:39:26Z)
The Unreasonable Effectiveness of Randomized Representations in Online Continual Graph Learning [23.73070470019403]
Catastrophic forgetting is one of the main obstacles for Online Continual Graph Learning (OCGL)<n>We use a fixed, randomly encoder to generate robust and expressive node embeddings by aggregating neighborhood information.<n>By freezing the encoder, we eliminate drifts of the representation parameters, a key source of forgetting, obtaining embeddings that are both expressive and stable.
arXiv Detail & Related papers (2025-10-08T09:44:14Z)
RefAM: Attention Magnets for Zero-Shot Referral Segmentation [103.98022860792504]
We introduce a new method that exploits features, attention scores, from diffusion transformers for downstream tasks.<n>Key insight is that stop words act as attention magnets.<n>We propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters.
arXiv Detail & Related papers (2025-09-26T17:59:57Z)
ERIS: An Energy-Guided Feature Disentanglement Framework for Out-of-Distribution Time Series Classification [51.07970070817353]
An ideal time series classification (TSC) should be able to capture invariant representations.<n>Current methods are largely unguided, lacking the semantic direction required to isolate truly universal features.<n>We propose an end-to-end Energy-Regularized Information for Shift-Robustness framework to enable guided and reliable feature disentanglement.
arXiv Detail & Related papers (2025-08-19T12:13:41Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Learning a Consensus Sub-Network with Polarization Regularization and One Pass Training [2.895034191799291]
Pruning schemes create extra overhead either by iterative training and fine-tuning for static pruning or repeated computation of a dynamic pruning graph.<n>We propose a new parameter pruning strategy for learning a lighter-weight sub-network that minimizes the energy cost while maintaining comparable performance to the fully parameterised network on given downstream tasks.<n>Our results on CIFAR-10, CIFAR-100, and Tiny Imagenet suggest that our scheme can remove 50% of connections in deep networks with 1% reduction in classification accuracy.
arXiv Detail & Related papers (2023-02-17T09:37:17Z)
Dynamic Feature Regularized Loss for Weakly Supervised Semantic Segmentation [37.43674181562307]
We propose a new regularized loss which utilizes both shallow and deep features that are dynamically updated. Our approach achieves new state-of-the-art performances, outperforming other approaches by a significant margin with more than 6% mIoU increase.
arXiv Detail & Related papers (2021-08-03T05:11:00Z)
An Information Bottleneck Approach for Controlling Conciseness in Rationale Extraction [84.49035467829819]
We show that it is possible to better manage this trade-off by optimizing a bound on the Information Bottleneck (IB) objective. Our fully unsupervised approach jointly learns an explainer that predicts sparse binary masks over sentences, and an end-task predictor that considers only the extracted rationale.
arXiv Detail & Related papers (2020-05-01T23:26:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.