Related papers: High-Dimensional Analysis of Single-Layer Attention for Sparse-Token Classification

High-Dimensional Analysis of Single-Layer Attention for Sparse-Token Classification

URL: http://arxiv.org/abs/2509.25153v1
Date: Mon, 29 Sep 2025 17:54:53 GMT
Title: High-Dimensional Analysis of Single-Layer Attention for Sparse-Token Classification
Authors: Nicholas Barnfield, Hugo Cui, Yue M. Lu,
Abstract summary: We show that a simple single-layer attention classifier can in principle achieve vanishing test error when the signal strength grows only logarithmically in the sequence length $L$.<n>We prove that just two gradient updates suffice for the query weight vector of the attention classifier to acquire a nontrivial alignment with the hidden signal, inducing an attention map that selectively amplifies informative tokens.
Score: 14.110007887109782
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When and how can an attention mechanism learn to selectively attend to informative tokens, thereby enabling detection of weak, rare, and sparsely located features? We address these questions theoretically in a sparse-token classification model in which positive samples embed a weak signal vector in a randomly chosen subset of tokens, whereas negative samples are pure noise. In the long-sequence limit, we show that a simple single-layer attention classifier can in principle achieve vanishing test error when the signal strength grows only logarithmically in the sequence length $L$, whereas linear classifiers require $\sqrt{L}$ scaling. Moving from representational power to learnability, we study training at finite $L$ in a high-dimensional regime, where sample size and embedding dimension grow proportionally. We prove that just two gradient updates suffice for the query weight vector of the attention classifier to acquire a nontrivial alignment with the hidden signal, inducing an attention map that selectively amplifies informative tokens. We further derive an exact asymptotic expression for the test error and training loss of the trained attention-based classifier, and quantify its capacity -- the largest dataset size that is typically perfectly separable -- thereby explaining the advantage of adaptive token selection over nonadaptive linear baselines.

Related papers

Provably Reliable Classifier Guidance via Cross-Entropy Control [4.298880233819988]
We show that cross-entropy loss at each diffusion model step is sufficient to control the corresponding guidance error.<n>Our result yields an upper bound on the sampling error vectors-guided diffusion models and bears resemblance to a reverse log-Sobolev-type inequality.
arXiv Detail & Related papers (2026-01-29T02:59:04Z)
CRITS: Convolutional Rectifier for Interpretable Time Series Classification [41.18535141696404]
We propose Convolutional Rectifier for Interpretable Time Series Classification, or CRITS, as an interpretable model for time series classification.<n>We evaluate CRITS on a set of datasets, and study its classification performance and its explanation alignment, sensitivity and understandability.
arXiv Detail & Related papers (2025-05-24T08:34:08Z)
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations [54.17275171325324]
We present a counterexample to the Linear Representation Hypothesis (LRH) When trained to repeat an input token sequence, neural networks learn to represent the token at each position with a particular order of magnitude, rather than a direction. These findings strongly indicate that interpretability research should not be confined to the LRH.
arXiv Detail & Related papers (2024-08-20T15:04:37Z)
Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification [49.09505771145326]
We propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels. Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.
arXiv Detail & Related papers (2024-04-26T06:00:27Z)
Semi-Supervised Laplace Learning on Stiefel Manifolds [48.3427853588646]
We develop the framework Sequential Subspace for graph-based, supervised samples at low-label rates. We achieves that our methods at extremely low rates, and high label rates.
arXiv Detail & Related papers (2023-07-31T20:19:36Z)
Do We Really Need a Learnable Classifier at the End of Deep Neural Network? [118.18554882199676]
We study the potential of learning a neural network for classification with the classifier randomly as an ETF and fixed during training. Our experimental results show that our method is able to achieve similar performances on image classification for balanced datasets.
arXiv Detail & Related papers (2022-03-17T04:34:28Z)
Soft-margin classification of object manifolds [0.0]
A neural population responding to multiple appearances of a single object defines a manifold in the neural response space. The ability to classify such manifold is of interest, as object recognition and other computational tasks require a response that is insensitive to variability within a manifold. Soft-margin classifiers are a larger class of algorithms and provide an additional regularization parameter used in applications to optimize performance outside the training set.
arXiv Detail & Related papers (2022-03-14T12:23:36Z)
Learning Debiased and Disentangled Representations for Semantic Segmentation [52.35766945827972]
We propose a model-agnostic and training scheme for semantic segmentation. By randomly eliminating certain class information in each training iteration, we effectively reduce feature dependencies among classes. Models trained with our approach demonstrate strong results on multiple semantic segmentation benchmarks.
arXiv Detail & Related papers (2021-10-31T16:15:09Z)
Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network. Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced. We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z)
Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only. The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z)
On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements [10.52087851034255]
We investigate the problem of classifying feature vectors with mutually independent but non-identically distributed elements. We show that the error probability goes to zero as the length of the feature vectors grows, even when there is only one training feature vector per label available.
arXiv Detail & Related papers (2020-08-01T06:49:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.