Related papers: Equivalent Linear Mappings of Large Language Models

Equivalent Linear Mappings of Large Language Models

URL: http://arxiv.org/abs/2505.24293v3
Date: Sat, 11 Oct 2025 04:29:51 GMT
Title: Equivalent Linear Mappings of Large Language Models
Authors: James R. Golden,
Abstract summary: We exploit a property of transformers where every operation can be expressed as $A(x) cdot x$, where $A(x)$ represents an input-dependent linear transform and $x$ preserves the linear pathway.<n>To expose this linear structure, we strategically detach components of the gradient with respect to an input sequence, freezing the $A(x)$ terms at their values computed during inference.<n>This detached Jacobian reconstructs the output with one linear operator per input token, which is shown for Qwen 3, Gemma 3 and Llama 3, up to Qwen 3 14B
Score: 0.5076419064097734
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite significant progress in transformer interpretability, an understanding of the computational mechanisms of large language models (LLMs) remains a fundamental challenge. Many approaches interpret a network's hidden representations but remain agnostic about how those representations are generated. We address this by mapping LLM inference for a given input sequence to an equivalent and interpretable linear system which reconstructs the predicted output embedding with relative error below $10^{-13}$ at double floating-point precision, requiring no additional model training. We exploit a property of transformers wherein every operation (gated activations, attention, and normalization) can be expressed as $A(x) \cdot x$, where $A(x)$ represents an input-dependent linear transform and $x$ preserves the linear pathway. To expose this linear structure, we strategically detach components of the gradient computation with respect to an input sequence, freezing the $A(x)$ terms at their values computed during inference, such that the Jacobian yields an equivalent linear mapping. This detached Jacobian of the model reconstructs the output with one linear operator per input token, which is shown for Qwen 3, Gemma 3 and Llama 3, up to Qwen 3 14B. These linear representations demonstrate that LLMs operate in extremely low-dimensional subspaces where the singular vectors can be decoded to interpretable semantic concepts. The computation for each intermediate output also has a linear equivalent, and we examine how the linear representations of individual layers and their attention and multilayer perceptron modules build predictions, and use these as steering operators to insert semantic concepts into unrelated text. Despite their global nonlinearity, LLMs can be interpreted through equivalent linear representations that reveal low-dimensional semantic structures in the next-token prediction process.

Related papers

Task Matrices: Linear Maps for Cross-Model Finetuning Transfer [4.243926243206826]
We develop the concept of a task matrix, a linear transformation from a base to finetuned embedding state.<n>We show that for vision and text models and ten different datasets, a base model augmented with a task matrix results surpassing linear probes.<n>Our results validate the existence of cross-layer linear encodings between pretrained and finetuned architectures.
arXiv Detail & Related papers (2025-12-16T19:51:29Z)
Sequential-Parallel Duality in Prefix Scannable Models [68.39855814099997]
Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba.<n>This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference?
arXiv Detail & Related papers (2025-06-12T17:32:02Z)
How can representation dimension dominate structurally pruned LLMs? [17.953689537875377]
Pruning assumes a subnetwork exists in the original deep neural network.<n>It is unclear how the model performance varies with the different subnetwork extractions.
arXiv Detail & Related papers (2025-03-06T12:28:59Z)
Large Language-Geometry Model: When LLM meets Equivariance [53.8505081745406]
We propose EquiLLM, a novel framework for representing 3D physical systems.<n>We show that EquiLLM delivers significant improvements over previous methods across molecular dynamics simulation, human motion simulation, and antibody design.
arXiv Detail & Related papers (2025-02-16T14:50:49Z)
Demystifying Singular Defects in Large Language Models [61.98878352956125]
In large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored.<n>We provide both theoretical insights and empirical validation across a range of recent models.<n>We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures.
arXiv Detail & Related papers (2025-02-10T20:09:16Z)
Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.<n>In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.<n>We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
Learning Linear Attention in Polynomial Time [127.14106124669645]
We provide the first results on learnability of single-layer Transformers with linear attention.<n>We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS.<n>We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z)
Unsupervised Representation Learning from Sparse Transformation Analysis [79.94858534887801]
We propose to learn representations from sequence data by factorizing the transformations of the latent variables into sparse components. Input data are first encoded as distributions of latent activations and subsequently transformed using a probability flow model.
arXiv Detail & Related papers (2024-10-07T23:53:25Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Transformer Block Coupling and its Correlation with Generalization in LLMs [3.007031501305338]
We analyze the trajectories of token embeddings as they pass through transformer blocks, linearizing the system along these trajectories through their Jacobian matrices.<n>We uncover the phenomenon of textbftransformer block coupling in a multitude of Large Language Models, characterized by the coupling of their top singular vectors across tokens and depth.<n>We further investigate how these properties emerge during training, observing a progressive development of coupling, increased linearity, and layer-wise exponential growth in token trajectories.
arXiv Detail & Related papers (2024-07-10T16:30:27Z)
Optimal Matrix-Mimetic Tensor Algebras via Variable Projection [0.0]
Matrix mimeticity arises from interpreting tensors as operators that can be multiplied, factorized, and analyzed analogous to matrices. We learn optimal linear mappings and corresponding tensor representations without relying on prior knowledge of the data. We provide original theory of uniqueness of the transformation and convergence analysis of our variable-projection-based algorithm.
arXiv Detail & Related papers (2024-06-11T04:52:23Z)
Weight-based Decomposition: A Case for Bilinear MLPs [0.0]
Gated Linear Units (GLUs) have become a common building block in modern foundation models. Bilinear layers drop the non-linearity in the "gate" but still have comparable performance to other GLUs. We develop a method to decompose the bilinear tensor into a set of interacting eigenvectors.
arXiv Detail & Related papers (2024-06-06T10:46:51Z)
Unlocking the Secrets of Linear Complexity Sequence Model from A Unified Perspective [26.479602180023125]
The Linear Complexity Sequence Model (LCSM) unites various sequence modeling techniques with linear complexity. We segment the modeling processes of these models into three distinct stages: Expand, Oscillation, and Shrink. We perform experiments to analyze the impact of different stage settings on language modeling and retrieval tasks.
arXiv Detail & Related papers (2024-05-27T17:38:55Z)
On the Origins of Linear Representations in Large Language Models [51.88404605700344]
We introduce a simple latent variable model to formalize the concept dynamics of the next token prediction. Experiments show that linear representations emerge when learning from data matching the latent variable model. We additionally confirm some predictions of the theory using the LLaMA-2 large language model.
arXiv Detail & Related papers (2024-03-06T17:17:36Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
Eigendecomposition-Free Training of Deep Networks for Linear Least-Square Problems [107.3868459697569]
We introduce an eigendecomposition-free approach to training a deep network. We show that our approach is much more robust than explicit differentiation of the eigendecomposition. Our method has better convergence properties and yields state-of-the-art results.
arXiv Detail & Related papers (2020-04-15T04:29:34Z)
Semi-Supervised Learning with Normalizing Flows [54.376602201489995]
FlowGMM is an end-to-end approach to generative semi supervised learning with normalizing flows. We show promising results on a wide range of applications, including AG-News and Yahoo Answers text data.
arXiv Detail & Related papers (2019-12-30T17:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.