Related papers: Linear Transformers Implicitly Discover Unified Numerical Algorithms

Linear Transformers Implicitly Discover Unified Numerical Algorithms

URL: http://arxiv.org/abs/2509.19702v1
Date: Wed, 24 Sep 2025 02:19:04 GMT
Title: Linear Transformers Implicitly Discover Unified Numerical Algorithms
Authors: Patrick Lutz, Aditya Gangrade, Hadi Daneshmand, Venkatesh Saligrama,
Abstract summary: We train a linear attention transformer on millions of masked-block matrix completion tasks.<n>The model sees only input-output pairs and a mean-squared loss; it is given no normal equations, no handcrafted iterations, and no hint that the tasks are related.<n>We prove that this rule achieves second-order convergence on full-batch problems, cuts distributed iteration complexity, and remains accurate with rank-limited attention.
Score: 28.247731784318404
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We train a linear attention transformer on millions of masked-block matrix completion tasks: each prompt is masked low-rank matrix whose missing block may be (i) a scalar prediction target or (ii) an unseen kernel slice of Nystr\"om extrapolation. The model sees only input-output pairs and a mean-squared loss; it is given no normal equations, no handcrafted iterations, and no hint that the tasks are related. Surprisingly, after training, algebraic unrolling reveals the same parameter-free update rule across three distinct computational regimes (full visibility, rank-limited updates, and distributed computation). We prove that this rule achieves second-order convergence on full-batch problems, cuts distributed iteration complexity, and remains accurate with rank-limited attention. Thus, a transformer trained solely to patch missing blocks implicitly discovers a unified, resource-adaptive iterative solver spanning prediction, estimation, and Nystr\"om extrapolation, highlighting a powerful capability of in-context learning.

Related papers

Unsupervised Representation Learning from Sparse Transformation Analysis [79.94858534887801]
We propose to learn representations from sequence data by factorizing the transformations of the latent variables into sparse components. Input data are first encoded as distributions of latent activations and subsequently transformed using a probability flow model.
arXiv Detail & Related papers (2024-10-07T23:53:25Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so. We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed. Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z)
A Second-Order Majorant Algorithm for Nonnegative Matrix Factorization [2.646309221150203]
We introduce a general second-order optimization framework for NMF under both quadratic and $beta$-divergence loss functions.<n>Second-Order Majorant (SOM) constructs a local quadratic majorization of the loss function by majorizing its Hessian matrix.<n>We show that mSOM consistently outperforms state-of-the-art algorithms across multiple loss functions.
arXiv Detail & Related papers (2023-03-31T12:09:36Z)
Optimal Input Gain: All You Need to Supercharge a Feed-Forward Neural Network [0.6562256987706128]
It is shown that pre-processing inputs using linear transformation are equivalent to multiplying the negative gradient matrix with an autocorrelation matrix per training iteration. It is shown that OIG improved HWO could be a significant building block to more complex deep learning architectures.
arXiv Detail & Related papers (2023-03-30T22:20:16Z)
Majorization-minimization for Sparse Nonnegative Matrix Factorization with the $\eta$-divergence [2.3787352248749376]
It is well known that the norm of the other factor (the dictionary matrix) needs to be controlled in order to avoid an ill-posed formulation. Standard practice consists in constraining the columns of the dictionary to have unit norm, which leads to a nontrivial optimization problem. We derive block-descent majorization-minimization algorithms that result in simple multiplicative updates for either $ell_1$-regularization or the more "aggressive" log-regularization.
arXiv Detail & Related papers (2022-07-13T16:09:29Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
Training Recurrent Neural Networks by Sequential Least Squares and the Alternating Direction Method of Multipliers [0.20305676256390928]
We propose the use of convex and twice-differentiable loss and regularization terms for determining optimal hidden network parameters. We combine sequential least squares with alternating direction multipliers. The performance of the algorithm is tested in a nonlinear system identification benchmark.
arXiv Detail & Related papers (2021-12-31T08:43:04Z)
Discovering Non-monotonic Autoregressive Orderings with Variational Inference [67.27561153666211]
We develop an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data. We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass. Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders.
arXiv Detail & Related papers (2021-10-27T16:08:09Z)
Alleviate Exposure Bias in Sequence Prediction \\ with Recurrent Neural Networks [47.52214243454995]
A popular strategy to train recurrent neural networks (RNNs) is to take the ground truth as input at each time step. We propose a fully differentiable training algorithm for RNNs to better capture long-term dependencies.
arXiv Detail & Related papers (2021-03-22T06:15:22Z)
A Scalable, Adaptive and Sound Nonconvex Regularizer for Low-rank Matrix Completion [60.52730146391456]
We propose a new non scalable low-rank regularizer called "nuclear Frobenius norm" regularizer, which is adaptive and sound. It bypasses the computation of singular values and allows fast optimization by algorithms. It obtains state-of-the-art recovery performance while being the fastest in existing matrix learning methods.
arXiv Detail & Related papers (2020-08-14T18:47:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.