Related papers: Transformers Learn Faster with Semantic Focus

Transformers Learn Faster with Semantic Focus

URL: http://arxiv.org/abs/2506.14095v2
Date: Wed, 18 Jun 2025 12:59:54 GMT
Title: Transformers Learn Faster with Semantic Focus
Authors: Parikshit Ram, Kenneth L. Clarkson, Tim Klinger, Shashanka Ubaru, Alexander G. Gray,
Abstract summary: We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
Score: 57.97235825738412
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Various forms of sparse attention have been explored to mitigate the quadratic computational and memory cost of the attention mechanism in transformers. We study sparse transformers not through a lens of efficiency but rather in terms of learnability and generalization. Empirically studying a range of attention mechanisms, we find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models, while input-agnostic sparse attention models show no such benefits -- a phenomenon that is robust across architectural and optimization hyperparameter choices. This can be interpreted as demonstrating that concentrating a model's "semantic focus" with respect to the tokens currently being considered (in the form of input-dependent sparse attention) accelerates learning. We develop a theoretical characterization of the conditions that explain this behavior. We establish a connection between the stability of the standard softmax and the loss function's Lipschitz properties, then show how sparsity affects the stability of the softmax and the subsequent convergence and generalization guarantees resulting from the attention mechanism. This allows us to theoretically establish that input-agnostic sparse attention does not provide any benefits. We also characterize conditions when semantic focus (input-dependent sparse attention) can provide improved guarantees, and we validate that these conditions are in fact met in our empirical evaluations.

Related papers

Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective [69.72942835553228]
This paper theoretically demonstrates that sigmoid self-attention is more sample-efficient than its softmax counterpart.<n>We represent the self-attention matrix as a mixture of experts and show that experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.
arXiv Detail & Related papers (2025-02-01T02:36:14Z)
Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models [7.80071686970278]
Traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases.<n>This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm.<n>We create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths.
arXiv Detail & Related papers (2025-01-23T07:21:08Z)
Optical aberrations in autonomous driving: Physics-informed parameterized temperature scaling for neural network uncertainty calibration [49.03824084306578]
We propose to incorporate a physical inductive bias into the neural network calibration architecture to enhance the robustness and the trustworthiness of the AI target application.<n>We pave the way for a trustworthy uncertainty representation and for a holistic verification strategy of the perception chain.
arXiv Detail & Related papers (2024-12-18T10:36:46Z)
Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density [93.32594873253534]
Trustworthy machine learning requires meticulous regulation of model reliance on non-robust features. We propose a framework to delineate and regulate such features by attributing model predictions to the input.
arXiv Detail & Related papers (2024-07-05T09:16:56Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data. We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z)
How Smooth Is Attention? [26.322030088685928]
We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios. We show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $sqrtn$ up to a constant factor. Our mean-field framework for masked self-attention is novel and of independent interest.
arXiv Detail & Related papers (2023-12-22T16:47:10Z)
Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes [52.92110730286403]
It is commonly believed that the marginal likelihood should be reminiscent of cross-validation metrics and that both should deteriorate with larger input dimensions. We prove that by tuning hyper parameters, the performance, as measured by the marginal likelihood, improves monotonically with the input dimension. We also prove that cross-validation metrics exhibit qualitatively different behavior that is characteristic of double descent.
arXiv Detail & Related papers (2022-10-14T08:09:33Z)
Robustness and Accuracy Could Be Reconcilable by (Proper) Definition [109.62614226793833]
The trade-off between robustness and accuracy has been widely studied in the adversarial literature. We find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty.
arXiv Detail & Related papers (2022-02-21T10:36:09Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.