Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
- URL: http://arxiv.org/abs/2505.21364v1
- Date: Tue, 27 May 2025 15:55:55 GMT
- Title: Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
- Authors: James Oldfield, Shawn Im, Yixuan Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios G Chrysos,
- Abstract summary: Multilayer perceptrons (MLPs) are integral part of large language models.<n>Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping.<n>In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse approximation.
- Score: 32.018429935819235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.
Related papers
- Mixture of Experts Made Intrinsically Interpretable [34.36996159677674]
We present textbfMoE-X, a Mixture-of-Experts (MoE) language model designed to be emphintrinsically interpretable.<n>Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors.<n>MoE-X achieves perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.
arXiv Detail & Related papers (2025-03-05T17:40:54Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.<n>We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.<n>Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination.
Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z) - GIFD: A Generative Gradient Inversion Method with Feature Domain
Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy.
Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge.
We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z) - Multiscale Invertible Generative Networks for High-Dimensional Bayesian
Inference [9.953855915186352]
We propose a Multiscale Invertible Generative Network (MsIGN) to solve high-dimensional Bayesian inference.
MsIGN exploits the low-dimensional nature of the posterior, and generates samples from coarse to fine scale.
On the natural image synthesis task, MsIGN achieves superior performance in bits-per-dimension over baseline models.
arXiv Detail & Related papers (2021-05-12T07:51:47Z) - Shaping Deep Feature Space towards Gaussian Mixture for Visual
Classification [74.48695037007306]
We propose a Gaussian mixture (GM) loss function for deep neural networks for visual classification.
With a classification margin and a likelihood regularization, the GM loss facilitates both high classification performance and accurate modeling of the feature distribution.
The proposed model can be implemented easily and efficiently without using extra trainable parameters.
arXiv Detail & Related papers (2020-11-18T03:32:27Z) - Markov-Lipschitz Deep Learning [37.7499958388076]
A prior constraint, called locally smoothness (LIS), is imposed across-layers and encoded into a Markov random field (MRF)-Gibbs distribution.
This leads to the best possible solutions for local geometry preservation and robustness.
Experiments, comparisons, and ablation study demonstrate significant advantages of MLDL for manifold learning and manifold data generation.
arXiv Detail & Related papers (2020-06-15T09:46:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.