Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
- URL: http://arxiv.org/abs/2510.11789v1
- Date: Mon, 13 Oct 2025 18:00:04 GMT
- Title: Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
- Authors: Shai Zucker, Xiong Wang, Fei Lu, Inbar Seroussi,
- Abstract summary: We study the convergence rate of learning pairwise interactions in single-layer attention-style models.<n>We prove that the minimax rate is $M-frac2beta2beta+1$ with $M$ being the sample size.
- Score: 9.144120605998138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2\beta}{2\beta+1}}$ with $M$ being the sample size, depending only on the smoothness $\beta$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.
Related papers
- Multi-agent imitation learning with function approximation: Linear Markov games and beyond [63.14746189846806]
We present the first theoretical analysis of multi-agent imitation learning (MAIL) in linear Markov games.<n>We show that it is possible to replace the state-action level "all policy deviation concentrability coefficient" with a concentrability coefficient defined at the feature level.<n>We propose a deep MAIL interactive algorithm which clearly outperforms BC on games such as Tic-Tac-Toe and Connect4.
arXiv Detail & Related papers (2026-02-26T09:50:15Z) - Explicit Multi-head Attention for Inter-head Interaction in Large Language Models [70.96854312026319]
Multi-head Explicit Attention (MEA) is a simple yet effective attention variant that explicitly models cross-head interaction.<n>MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence.<n>This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss.
arXiv Detail & Related papers (2026-01-27T13:45:03Z) - Mamaba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning [53.983686308399676]
Mamba is a proposed linear-time sequence model with strong empirical performance.<n>We study in-context learning of a single-index model $y approx g_*(langle boldsymbolbeta, boldsymbolx rangle)$.<n>We prove that Mamba, pretrained by gradient-based methods, can achieve efficient ICL via test-time feature learning.
arXiv Detail & Related papers (2025-10-14T00:21:20Z) - A Random Matrix Analysis of In-context Memorization for Nonlinear Attention [18.90197287760915]
We show that nonlinear Attention incurs higher memorization error than linear ridge regression on random inputs.<n>Our results reveal how nonlinearity and input structure interact with each other to govern the memorization performance of nonlinear Attention.
arXiv Detail & Related papers (2025-06-23T13:56:43Z) - Robustness of Nonlinear Representation Learning [60.15898117103069]
We study the problem of unsupervised representation learning in slightly misspecified settings.<n>We show that the mixing can be identified up to linear transformations and small errors.<n>Those results are a step towards identifiability results for unsupervised representation learning for real-world data.
arXiv Detail & Related papers (2025-03-19T15:57:03Z) - Low-Rank Matrix Factorizations with Volume-based Constraints and Regularizations [2.6687460222685226]
This thesis focuses on volume-based constraints and regularizations designed to enhance interpretability and uniqueness.<n>Motivated by applications such as blind source separation and missing data imputation, this thesis also proposes efficient algorithms.
arXiv Detail & Related papers (2024-12-09T10:58:23Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - On Characterizing and Mitigating Imbalances in Multi-Instance Partial Label Learning [57.18649648182171]
We make contributions towards addressing a problem that hasn't been studied so far in the context of MI-PLL.<n>We derive class-specific risk bounds for MI-PLL, while making minimal assumptions.<n>Our theory reveals a unique phenomenon: that $sigma$ can greatly impact learning imbalances.
arXiv Detail & Related papers (2024-07-13T20:56:34Z) - The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks [0.0]
Local Interaction Basis aims to identify computational features by removing irrelevant activations and interactions.
We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models.
We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to large language models.
arXiv Detail & Related papers (2024-05-17T17:27:19Z) - A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data.
We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z) - Learning Cross-view Geo-localization Embeddings via Dynamic Weighted
Decorrelation Regularization [52.493240055559916]
Cross-view geo-localization aims to spot images of the same location shot from two platforms, e.g., the drone platform and the satellite platform.
Existing methods usually focus on optimizing the distance between one embedding with others in the feature space.
In this paper, we argue that the low redundancy is also of importance, which motivates the model to mine more diverse patterns.
arXiv Detail & Related papers (2022-11-10T02:13:10Z) - Attention improves concentration when learning node embeddings [1.2233362977312945]
Given nodes labelled with search query text, we want to predict links to related queries that share products.
Experiments with a range of deep neural architectures show that simple feedforward networks with an attention mechanism perform best for learning embeddings.
We propose an analytically tractable model of query generation, AttEST, that views both products and the query text as vectors embedded in a latent space.
arXiv Detail & Related papers (2020-06-11T21:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.