Related papers: Element-wise Attention Is All You Need

Element-wise Attention Is All You Need

URL: http://arxiv.org/abs/2501.05730v1
Date: Fri, 10 Jan 2025 05:54:04 GMT
Title: Element-wise Attention Is All You Need
Authors: Guoxin Feng,
Abstract summary: Self-attention mechanism has superior performance across various domains, yet suffers from complexity during both training and inference.<n>We propose a novel element-wise attention mechanism, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity.<n>During inference, it can be reformulated as recurrent neural networks, achieving a inference of $mathcalO(tD)$.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The self-attention (SA) mechanism has demonstrated superior performance across various domains, yet it suffers from substantial complexity during both training and inference. The next-generation architecture, aiming at retaining the competitive performance of SA while achieving low-cost inference and efficient long-sequence training, primarily focuses on three approaches: linear attention, linear RNNs, and state space models. Although these approaches achieve reduced complexity than SA, they all have built-in performance degradation factors, such as diminished “spikiness” and compression of historical information. In contrast to these approaches, we propose a novel element-wise attention mechanism, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term $\exp(q_{ic}k_{jc})$ with a Taylor polynomial. This design achieves remarkable efficiency: during training, the element-wise attention has a complexity of $\mathcal{O}(tLD)$, making long-sequence training both computationally and memory efficient, where $L$ is the sequence length, $D$ is the feature dimension, and $t$ is the highest order of the polynomial; during inference, it can be reformulated as recurrent neural networks, achieving a inference complexity of $\mathcal{O}(tD)$. Furthermore, the element-wise attention circumvents the performance degradation factors present in these approaches and achieves performance comparable to SA in both causal and non-causal forms.

Related papers

Efficient Network Automatic Relevance Determination [30.611086842690426]
Network Automatic Relevance Determination (NARD) is an extension of ARD for linearly probabilistic models.<n>NARD simultaneously model sparse relationships between inputs $X in mathbb Rd times N$ and outputs $Y in mathbb Rm times N$.
arXiv Detail & Related papers (2025-06-14T05:20:25Z)
Enhanced Computationally Efficient Long LoRA Inspired Perceiver Architectures for Auto-Regressive Language Modeling [2.9228447484533695]
The Transformer architecture has revolutionized the Natural Language Processing field and is the backbone of Large Language Models (LLMs) One of the challenges in the Transformer architecture is the quadratic complexity of the attention mechanism that prohibits the efficient processing of long sequence lengths. One of the important works in this respect is the Perceiver class of architectures that have demonstrated excellent performance while reducing the computation complexity.
arXiv Detail & Related papers (2024-12-08T23:41:38Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models. HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning. MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z)
Scaling ResNets in the Large-depth Regime [11.374578778690623]
Deep ResNets are recognized for achieving state-of-the-art results in machine learning tasks. Deep ResNets rely on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $alpha_L$.
arXiv Detail & Related papers (2022-06-14T15:49:10Z)
Matching Pursuit Based Scheduling for Over-the-Air Federated Learning [67.59503935237676]
This paper develops a class of low-complexity device scheduling algorithms for over-the-air learning via the method of federated learning. Compared to the state-of-the-art proposed scheme, the proposed scheme poses a drastically lower efficiency system. The efficiency of the proposed scheme is confirmed via experiments on the CIFAR dataset.
arXiv Detail & Related papers (2022-06-14T08:14:14Z)
Poly-NL: Linear Complexity Non-local Layers with Polynomials [76.21832434001759]
We formulate novel fast NonLocal blocks, capable of reducing complexity from quadratic to linear with no loss in performance. The proposed method, which we dub as "Poly-NL", is competitive with state-of-the-art performance across image recognition, instance segmentation, and face detection tasks.
arXiv Detail & Related papers (2021-07-06T19:51:37Z)
The Computational Complexity of ReLU Network Training Parameterized by Data Dimensionality [8.940054309023525]
We analyze the influence of the dimension $d$ of the training data on the computational complexity. We prove that known brute-force strategies are essentially optimal. In particular, we extend a known-time algorithm for constant $d$ and convex loss functions to a more general class of loss functions.
arXiv Detail & Related papers (2021-05-18T17:05:26Z)
On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning. In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function. Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
The Complexity of Adversarially Robust Proper Learning of Halfspaces with Agnostic Noise [67.27523616312428]
We study the computational complexity of adversarially robust proper learning of halfspaces in the distribution-independent PAC model. We give a computationally efficient learning algorithm and a nearly matching computational hardness result for this problem.
arXiv Detail & Related papers (2020-07-30T04:18:51Z)
Sample-Efficient Reinforcement Learning of Undercomplete POMDPs [91.40308354344505]
This work shows that these hardness barriers do not preclude efficient reinforcement learning for rich and interesting subclasses of Partially Observable Decision Processes (POMDPs) We present a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than the number of latent states and where exploration is essential for learning, thus distinguishing our results from prior works.
arXiv Detail & Related papers (2020-06-22T17:58:54Z)
Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters [152.03852111442114]
Although model-agnostic metalearning (MAML) is a very successful algorithm meta-learning practice, it can have high computational complexity. Our paper shows that such complexity can significantly affect the overall convergence performance of ANIL.
arXiv Detail & Related papers (2020-06-16T19:57:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.