Related papers: HLA: Hadamard Linear Attention

HLA: Hadamard Linear Attention

URL: http://arxiv.org/abs/2602.12128v1
Date: Thu, 12 Feb 2026 16:16:47 GMT
Title: HLA: Hadamard Linear Attention
Authors: Hanno Ackermann, Hong Cai, Mohsen Ghafoorian, Amirhossein Habibian,
Abstract summary: The attention mechanism is an important reason for the success of transformers.<n>It relies on computing pairwise relations between tokens.<n> linear attention has been proposed as an efficient approximation.
Score: 23.409131174857666
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The attention mechanism is an important reason for the success of transformers. It relies on computing pairwise relations between tokens. To reduce the high computational cost of standard quadratic attention, linear attention has been proposed as an efficient approximation. It employs kernel functions that are applied independently to the inputs before the pairwise similarities are calculated. That allows for an efficient computational procedure which, however, amounts to a low-degree rational function approximating softmax. We propose Hadamard Linear Attention (HLA). Unlike previous works on linear attention, the nonlinearity in HLA is not applied separately to queries and keys, but, analogously to standard softmax attention, after the pairwise similarities have been computed. It will be shown that the proposed nonlinearity amounts to a higher-degree rational function to approximate softmax. An efficient computational scheme for the proposed method is derived that is similar to that of standard linear attention. In contrast to other approaches, no time-consuming tensor reshaping is necessary to apply the proposed algorithm. The effectiveness of the approach is demonstrated by applying it to a large diffusion transformer model for video generation, an application that involves very large amounts of tokens.

Related papers

Higher-order Linear Attention [59.92962330635185]
quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts.<n>We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics.
arXiv Detail & Related papers (2025-10-31T07:54:37Z)
CLAP: Concave Linear APproximation for Quadratic Graph Matching [5.417323487240968]
We introduce a linear model and designed a novel approximation matrix for graph matching. We then transform the original QAP into a linear model that is concave for the structural constraint. This model can be solved using the Sinkhorn optimal transport algorithm.
arXiv Detail & Related papers (2024-10-22T15:28:18Z)
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
Neural incomplete factorization: learning preconditioners for the conjugate gradient method [2.899792823251184]
We develop a data-driven approach to accelerate the generation of effective preconditioners. We replace the typically hand-engineered preconditioners by the output of graph neural networks. Our method generates an incomplete factorization of the matrix and is, therefore, referred to as neural incomplete factorization (NeuralIF)
arXiv Detail & Related papers (2023-05-25T11:45:46Z)
Learning distributed representations with efficient SoftMax normalization [3.8673630752805437]
We propose a linear-time approximation to compute the normalization constants of $rm SoftMax(XYT)$ for embedding vectors with bounded norms.<n>We show on some pre-trained embedding datasets that the proposed estimation method achieves higher or comparable accuracy with competing methods.<n>The proposed algorithm is interpretable and easily adapted to arbitrary embedding problems.
arXiv Detail & Related papers (2023-03-30T15:48:26Z)
Quantum Splines for Non-Linear Approximations [2.064612766965483]
We propose an efficient implementation of quantum splines for non-linear approximation. In particular, we first discuss possible parametrisations, and select the most convenient for exploiting the HHL algorithm.
arXiv Detail & Related papers (2023-03-09T17:21:11Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing. We identify that their limitations are rooted in keeping the softmax self-attention during approximations. For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z)
Hybrid Trilinear and Bilinear Programming for Aligning Partially Overlapping Point Sets [85.71360365315128]
In many applications, we need algorithms which can align partially overlapping point sets are invariant to the corresponding corresponding RPM algorithm. We first show that the objective is a cubic bound function. We then utilize the convex envelopes of trilinear and bilinear monomial transformations to derive its lower bound. We next develop a branch-and-bound (BnB) algorithm which only branches over the transformation variables and runs efficiently.
arXiv Detail & Related papers (2021-01-19T04:24:23Z)
Efficient Learning of Generative Models via Finite-Difference Score Matching [111.55998083406134]
We present a generic strategy to efficiently approximate any-order directional derivative with finite difference. Our approximation only involves function evaluations, which can be executed in parallel, and no gradient computations.
arXiv Detail & Related papers (2020-07-07T10:05:01Z)
Efficient Alternating Least Squares Algorithms for Low Multilinear Rank Approximation of Tensors [6.308492837096872]
We propose a new class of truncated HOSVD algorithms based on alternating least squares (ALS) for efficiently computing the low multilinear rank approximation of tensors. ALS-based approaches are able to eliminate the redundant computations of the singular vectors of intermediate matrices and are therefore free of data explosion. Numerical experiments with large-scale tensors from both synthetic and real-world applications demonstrate that ALS-based methods can substantially reduce the total cost of the original ones and are highly scalable for parallel computing.
arXiv Detail & Related papers (2020-04-06T11:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.