Related papers: Localmax dynamics for attention in transformers and its asymptotic behavior

Localmax dynamics for attention in transformers and its asymptotic behavior

URL: http://arxiv.org/abs/2509.15958v1
Date: Fri, 19 Sep 2025 13:18:30 GMT
Title: Localmax dynamics for attention in transformers and its asymptotic behavior
Authors: Henri Cimetière, Maria Teresa Chiri, Bahman Gharesifard,
Abstract summary: We introduce a new discrete-time attention model, the localmax dynamics, where only the tokens that maximize the influence toward a given token have a positive weight.<n>We show that localmax dynamics does not exhibit finite-time convergence and provide results for vanishing, nonzero, time-varying alignment-sensitivity parameters.<n>We also adapt Lyapunov-based methods from classical opinion dynamics, highlighting their limitations in the asymmetric setting of localmax interactions.
Score: 1.376408511310322
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce a new discrete-time attention model, termed the localmax dynamics, which interpolates between the classic softmax dynamics and the hardmax dynamics, where only the tokens that maximize the influence toward a given token have a positive weight. As in hardmax, uniform weights are determined by a parameter controlling neighbor influence, but the key extension lies in relaxing neighborhood interactions through an alignment-sensitivity parameter, which allows controlled deviations from pure hardmax behavior. As we prove, while the convex hull of the token states still converges to a convex polytope, its structure can no longer be fully described by a maximal alignment set, prompting the introduction of quiescent sets to capture the invariant behavior of tokens near vertices. We show that these sets play a key role in understanding the asymptotic behavior of the system, even under time-varying alignment sensitivity parameters. We further show that localmax dynamics does not exhibit finite-time convergence and provide results for vanishing, nonzero, time-varying alignment-sensitivity parameters, recovering the limiting behavior of hardmax as a by-product. Finally, we adapt Lyapunov-based methods from classical opinion dynamics, highlighting their limitations in the asymmetric setting of localmax interactions and outlining directions for future research.

Related papers

Towards Arbitrary Motion Completing via Hierarchical Continuous Representation [64.6525112550758]
We propose a novel parametric activation-induced hierarchical implicit representation framework, called NAME, based on Implicit Representations (INRs)<n>Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns.
arXiv Detail & Related papers (2025-12-24T14:07:04Z)
Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective [16.076157672455867]
We develop a measure-based framework for studying single-layer softmax attention under both finite and infinite prompts.<n>For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure.
arXiv Detail & Related papers (2025-12-12T18:54:52Z)
Statistical Advantage of Softmax Attention: Insights from Single-Location Regression [0.0]
We study the dominance of softmax over alternatives in large language models.<n>We show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short.<n>We discuss the connection with optimization by gradient-based algorithms.
arXiv Detail & Related papers (2025-09-26T06:21:30Z)
Manifold Trajectories in Next-Token Prediction: From Replicator Dynamics to Softmax Equilibrium [0.0]
Decoding in large language models is often described as scoring tokens and normalizing with softmax.<n>We give a self-contained hallucination of this step as a constrained variational principle on the probability simplex.<n>We prove that, for a fixed context and temperature, the next-token distribution follows a smooth trajectory inside the simplex and converges to the softmax equilibrium.
arXiv Detail & Related papers (2025-08-28T20:00:22Z)
Long-Context Generalization with Sparse Attention [21.312711979288004]
Transformer-based architectures traditionally employ softmax to compute attention weights.<n>As sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse.<n>We show in this paper that sparse attention mechanisms using $alpha$-entmax can avoid these issues.
arXiv Detail & Related papers (2025-06-19T22:43:25Z)
Self-Adjust Softmax [62.267367768385434]
The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one.<n>We propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x cdot softmax(x)$ and its normalized variant $frac(x - min(x_min,0))max(0,x_max)-min(x_min,0) cdot softmax(x)$.
arXiv Detail & Related papers (2025-02-25T15:07:40Z)
Towards Spectral Convergence of Locally Linear Embedding on Manifolds with Boundary [0.0]
We study the eigenvalues and eigenfunctions of a differential operator that governs the behavior of the unsupervised learning algorithm known as Locally Linear Embedding.<n>We show that a natural regularity condition on the eigenfunctions imposes a consistent boundary condition and use the Frobenius method to estimate pointwise behavior.
arXiv Detail & Related papers (2025-01-16T14:45:53Z)
Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality [54.20763128054692]
We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics.
arXiv Detail & Related papers (2024-02-29T18:43:52Z)
Convex Bounds on the Softmax Function with Applications to Robustness Verification [69.09991317119679]
The softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well. This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models.
arXiv Detail & Related papers (2023-03-03T05:07:02Z)
Nesterov Meets Optimism: Rate-Optimal Separable Minimax Optimization [108.35402316802765]
We propose a new first-order optimization algorithm -- AcceleratedGradient-OptimisticGradient (AG-OG) Ascent. We show that AG-OG achieves the optimal convergence rate (up to a constant) for a variety of settings. We extend our algorithm to extend the setting and achieve the optimal convergence rate in both bi-SC-SC and bi-C-SC settings.
arXiv Detail & Related papers (2022-10-31T17:59:29Z)
Stabilizing Q Learning Via Soft Mellowmax Operator [12.208344427928466]
Mellowmax is a proposed differentiable and non-expansion softmax operator that allows a convergent behavior in learning and planning. We show that our SM2 operator can be applied to the challenging multi-agent reinforcement learning scenarios, leading to stable value function approximation and state of the art performance.
arXiv Detail & Related papers (2020-12-17T09:11:13Z)
Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization [98.0595480384208]
We propose a generalization extraient spaces which converges to a stationary point. The algorithm applies not only to general $p$-normed spaces, but also to general $p$-dimensional vector spaces.
arXiv Detail & Related papers (2020-10-31T21:35:42Z)
Optimal Approximation -- Smoothness Tradeoffs for Soft-Max Functions [73.33961743410876]
A soft-max function has two main efficiency measures: approximation and smoothness. We identify the optimal approximation-smoothness tradeoffs for different measures of approximation and smoothness. This leads to novel soft-max functions, each of which is optimal for a different application.
arXiv Detail & Related papers (2020-10-22T05:19:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.