Related papers: Sliced ReLU attention: Quasi-linear contextual expressivity via sorting

Sliced ReLU attention: Quasi-linear contextual expressivity via sorting

URL: http://arxiv.org/abs/2512.11411v1
Date: Fri, 12 Dec 2025 09:39:14 GMT
Title: Sliced ReLU attention: Quasi-linear contextual expressivity via sorting
Authors: Siwan Boufadène, François-Xavier Vialard,
Abstract summary: We introduce sliced ReLU attention, a new attention mechanism that departs structurally from both softmax and ReLU-based alternatives.<n>This construction yields a differentiable, non-symmetric kernel that can be computed in O(n log(n) through a sorting procedure.<n>We show that sliced ReLU attention preserves the ability to perform nontrivial sequence-to-sequence disentangling tasks.
Score: 6.429235334672735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce sliced ReLU attention, a new attention mechanism that departs structurally from both softmax and ReLU-based alternatives. Instead of applying a nonlinearity to pairwise dot products, we operate on one-dimensional projections of key--query differences and leverage sorting to obtain quasi-linear complexity. This construction yields a differentiable, non-symmetric kernel that can be computed in O(n log(n)) through a sorting procedure, making it suitable for very long contexts. Beyond computational benefits, the model retains strong theoretical expressive power: we establish two in-context expressivity results, previously known for softmax attention, showing that sliced ReLU attention preserves the ability to perform nontrivial sequence-to-sequence disentangling tasks and satisfies a contextual universal approximation property. Finally, we illustrate the potential practical interest of this kernel in small-scale experiments.

Related papers

Efficient Linear Attention for Multivariate Time Series Modeling via Entropy Equality [30.606567864965243]
We propose a novel linear attention mechanism designed to overcome limitations.<n>We develop an efficient algorithm that computes the entropy of dot-product-derived distributions with only linear complexity.
arXiv Detail & Related papers (2025-11-05T05:07:55Z)
Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization [68.22688819802622]
We propose a new state-of-the-art complexity of $O(/epsilon)$ for finding an (nearly) $'level KKT solution.<n>By applying our hinge-of-the-art complexity of $O(/epsilon)$ for finding an (nearly) $'level KKT solution, we achieve a new state-of-the-art complexity of $O(/epsilon)$ for finding an (nearly) $'level KKT solution.
arXiv Detail & Related papers (2025-06-03T06:31:59Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Linearization Algorithms for Fully Composite Optimization [61.20539085730636]
This paper studies first-order algorithms for solving fully composite optimization problems convex compact sets. We leverage the structure of the objective by handling differentiable and non-differentiable separately, linearizing only the smooth parts.
arXiv Detail & Related papers (2023-02-24T18:41:48Z)
Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency [111.83670279016599]
We study reinforcement learning for partially observed decision processes (POMDPs) with infinite observation and state spaces. We make the first attempt at partial observability and function approximation for a class of POMDPs with a linear structure.
arXiv Detail & Related papers (2022-04-20T21:15:38Z)
Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented. $p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z)
Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization [98.0595480384208]
We propose a generalization extraient spaces which converges to a stationary point. The algorithm applies not only to general $p$-normed spaces, but also to general $p$-dimensional vector spaces.
arXiv Detail & Related papers (2020-10-31T21:35:42Z)
Implicit Regularization via Neural Feature Alignment [39.257382575749354]
We highlight a regularization effect induced by a dynamical alignment of neural tangent features. By extrapolating a new analysis of Rademacher complexity bounds for linear models, we motivate and study a complexity measure that captures this phenomenon.
arXiv Detail & Related papers (2020-08-03T15:18:07Z)
Dualize, Split, Randomize: Toward Fast Nonsmooth Optimization Algorithms [21.904012114713428]
We consider the sum of three convex functions, where the first one F is smooth, the second one is nonsmooth and proximable. This template problem has many applications, for instance, in image processing and machine learning. We propose a new primal-dual algorithm, which we call PDDY, for this problem.
arXiv Detail & Related papers (2020-04-03T10:48:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.