Related papers: Fast Gradient Computation for RoPE Attention in Almost Linear Time

Fast Gradient Computation for RoPE Attention in Almost Linear Time

URL: http://arxiv.org/abs/2412.17316v2
Date: Tue, 31 Dec 2024 06:53:40 GMT
Title: Fast Gradient Computation for RoPE Attention in Almost Linear Time
Authors: Yifang Chen, Jiayan Huo, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song,
Abstract summary: We develop the first almost linear time algorithm for backward computations in RoPE-based attention under bounded entries.<n>Our approach builds on recent advancements in fast RoPE attention computations.
Score: 27.28314860714307
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Rotary Position Embedding (RoPE) mechanism has become a powerful enhancement to the Transformer architecture, which enables models to capture token relationships when encoding positional information. However, the RoPE mechanisms make the computations of attention mechanisms more complicated, which makes efficient algorithms challenging. Earlier research introduced almost linear time, i.e., $n^{1+o(1)}$ where $n$ is the number of input tokens, algorithms for the forward computation under specific parameter settings. However, achieving a subquadratic time algorithm for other parameter regimes remains impossible unless the widely accepted Strong Exponential Time Hypothesis (SETH) is disproven. In this work, we develop the first almost linear time algorithm for backward computations in the RoPE-based attention under bounded entries. Our approach builds on recent advancements in fast RoPE attention computations, utilizing a novel combination of the polynomial method and the Fast Fourier Transform. Furthermore, we show that with lower bounds derived from the SETH, the bounded entry condition is necessary for subquadratic performance.

Related papers

Fast RoPE Attention: Combining the Polynomial Method and Fast Fourier Transform [10.88046646153971]
A main bottleneck in the time to perform transformer computations is a task called attention computation.<n>We show how to overcome this issue, and give a new algorithm to compute the RoPE attention in almost linear time.
arXiv Detail & Related papers (2025-05-17T08:03:50Z)
Fast MLE and MAPE-Based Device Activity Detection for Grant-Free Access via PSCA and PSCA-Net [13.076905065264091]
Fast and accurate device activity is the critical challenge in grant-free access for supporting massive machine-type communications. We propose new maximum likelihood estimation (MLE) based device activity detection methods. We present a deep unrolling neural network implementation called PSCA-Net to further reduce the computation time.
arXiv Detail & Related papers (2025-03-19T14:31:09Z)
Alternating minimization for square root principal component pursuit [2.449191760736501]
We develop efficient algorithms for solving the square root principal component pursuit (SRPCP) problem. Specifically, we propose a tuning-free alternating minimization (AltMin) algorithm, where each iteration involves subproblems enjoying closed-form optimal solutions. We introduce techniques based on the variational formulation of the nuclear norm and Burer-Monteiro decomposition to further accelerate the AltMin method.
arXiv Detail & Related papers (2024-12-31T14:43:50Z)
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
Tractable Bounding of Counterfactual Queries by Knowledge Compilation [51.47174989680976]
We discuss the problem of bounding partially identifiable queries, such as counterfactuals, in Pearlian structural causal models. A recently proposed iterated EM scheme yields an inner approximation of those bounds by sampling the initialisation parameters. We show how a single symbolic knowledge compilation allows us to obtain the circuit structure with symbolic parameters to be replaced by their actual values.
arXiv Detail & Related papers (2023-10-05T07:10:40Z)
Approximative lookup-tables and arbitrary function rotations for facilitating NISQ-implementations of the HHL and beyond [6.1003703380200545]
We propose a circuit approximation technique that enhances the arithmetic subroutines in the HHL. We show how these types of circuits can be reduced in depth by providing a simple and powerful approximation technique.
arXiv Detail & Related papers (2023-06-08T08:22:41Z)
Higher-order Derivatives of Weighted Finite-state Machines [68.43084108204741]
This work examines the computation of higher-order derivatives with respect to the normalization constant for weighted finite-state machines. We provide a general algorithm for evaluating derivatives of all orders, which has not been previously described in the literature. Our algorithm is significantly faster than prior algorithms.
arXiv Detail & Related papers (2021-06-01T19:51:55Z)
Sublinear Least-Squares Value Iteration via Locality Sensitive Hashing [49.73889315176884]
We present the first provable Least-Squares Value Iteration (LSVI) algorithms that have runtime complexity sublinear in the number of actions. We build the connections between the theory of approximate maximum inner product search and the regret analysis of reinforcement learning.
arXiv Detail & Related papers (2021-05-18T05:23:53Z)
Rapid Robust Principal Component Analysis: CUR Accelerated Inexact Low Rank Estimation [8.169365031508885]
We propose a novel non-RPC algorithm coined Iterated Robust CUR (IRCUR) IRCUR is able to process only small submatrices and avoid expensive computing on the full matrix through the entire algorithm. Numerical experiments establish the computational advantage of IRCUR on both synthetic and real-world datasets.
arXiv Detail & Related papers (2020-10-14T22:30:20Z)
Approximation Algorithms for Sparse Principal Component Analysis [57.5357874512594]
Principal component analysis (PCA) is a widely used dimension reduction technique in machine learning and statistics. Various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis. We present thresholding as a provably accurate, time, approximation algorithm for the SPCA problem.
arXiv Detail & Related papers (2020-06-23T04:25:36Z)
Lagrangian Decomposition for Neural Network Verification [148.0448557991349]
A fundamental component of neural network verification is the computation of bounds on the values their outputs can take. We propose a novel approach based on Lagrangian Decomposition. We show that we obtain bounds comparable with off-the-shelf solvers in a fraction of their running time.
arXiv Detail & Related papers (2020-02-24T17:55:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.