Fugu-MT 論文翻訳(概要): One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

論文の概要: One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

arxiv url: http://arxiv.org/abs/2311.14652v2
Date: Mon, 5 Feb 2024 18:30:30 GMT
ステータス: 翻訳完了
システム内更新日: 2024-02-07 03:53:20.267847
Title: One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space
Title（参考訳）: サブリニア空間における超長期トークン注意近似のためのワンパスストリーミングアルゴリズム
Authors: Raghav Addanki, Chenyang Li, Zhao Song, Chiwun Yang
Abstract要約: 注意計算は、$O(n2)$の時間複雑性と$O(n2)$の空間複雑性を同時に行う。ストリーミング方式で1パスのデータのみを読み取る新しいアルゴリズムを導入する。特に,本アルゴリズムは,超長期トークンを用いたメモリ効率の優れた性能を示す。
参考スコア（独自算出の注目度）: 11.735802740426294
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Attention computation takes both the time complexity of $O(n^2)$ and the space complexity of $O(n^2)$ simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2^d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}^{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}^{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}^{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n^{1+o(1)}$ time executions. Despite this, computing the approximated attention matrix $U_1U_2^\top \in \mathbb{R}^{n \times n}$ still necessitates $O(n^2)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in a streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.
Abstract（参考訳）: 注意計算は、$O(n^2)$の時間的複雑さと$O(n^2)$の空間的複雑さの両方を同時に取り、膨大な計算資源を必要とする長いコンテキストを含むストリーミングアプリケーションに大規模言語モデル(LLM)をデプロイする。先日のOpenAI DevDay (2023年2月6日)で、OpenAIは、128Kのドキュメントをサポート可能な新しいモデルをリリースした。 Query, Key, and Value 行列 $Q, K, V \in \mathbb{R}^{n \times d}$ の単層自己アテンションを考えると、多項式法は注意出力 $T \in \mathbb{R}^{n \times d}$ を近似する。 u_1, u_2 \in \mathbb{r}^{n \times t}$ to expedite attention ${\sf attn}(q, k, v)$ computation within $n^{1+o(1)}$ time executions. でこれを実現している。それにもかかわらず、近似された注目行列$U_1U_2^\top \in \mathbb{R}^{n \times n}$はまだ$O(n^2)$空間を必要とするため、メモリ使用量は大幅に増加する。これらの課題に対応して,ストリーミング方式でデータの1パスのみを読み取る新しいアルゴリズムを導入する。この方法は3つのスケッチ行列を格納するためにサブ線形空間$o(n)$を使用し、正確な$K、V$ストレージの必要性を緩和する。特に,超長トークンを用いたメモリ効率の優れた性能を示す。トークン長の$n$が増加すると、メモリ使用量がほぼ一定である間、エラー保証は減少します。このユニークな属性は、ストリーミングアプリケーションにおけるLLMの効率的な処理における我々の技術の可能性を示している。

関連論文リスト

Differentially Private Space-Efficient Algorithms for Counting Distinct Elements in the Turnstile Model [61.40326886123332]
ターンタイルストリーミングモデルにおいて、異なる要素を数えるという根本的な問題に対して、最初のサブ線形空間を微分プライベートなアルゴリズムを与える。本結果は, 線形問題である最先端アルゴリズムの空間要求を著しく改善する。ストリームにアイテムが現れる回数の制限付き$W$が分かっている場合、我々のアルゴリズムは$tildeO_eta(sqrtW)$ space.sqrtW)$ additive errorを提供する。
論文参考訳（メタデータ） (2025-05-29T17:21:20Z)
HSR-Enhanced Sparse Attention Acceleration [19.776342074253435]
大規模言語モデル(LLM)における注意計算を高速化する新しい手法を提案する。我々は,従来のSoftmaxアテンションとReLUアテンションの両方において,アテンションメカニズム内の固有空間を利用する。提案手法は,Softmaxの注意を確実に無視できる誤差を導入するのみである。
論文参考訳（メタデータ） (2024-10-14T05:18:02Z)
LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions [54.54897832889028]
任意の$K$に対して、$n$とは独立に「普遍集合」$Uサブセット[n]$が存在し、任意の$Q$と任意の行$i$に対して、大きな注目スコアが$A_i,j$ in row $i$ of $A$は全て$jin U$を持つことを示す。我々は、視覚変換器のスキームの利点を実証的に示し、トレーニング中に我々の普遍的なセットを使用する新しいモデルのトレーニング方法を示した。
論文参考訳（メタデータ） (2024-10-07T19:47:13Z)
Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
線形スケッチを用いた行列とベクトルノルムの残差誤差推定問題について検討する。これは、前作とほぼ同じスケッチサイズと精度で、経験的にかなり有利であることを示す。また、スパースリカバリ問題に対して$Omega(k2/pn1-2/p)$低いバウンダリを示し、これは$mathrmpoly(log n)$ factorまで厳密である。
論文参考訳（メタデータ） (2024-08-16T02:33:07Z)
How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse [9.552839922307587]
スパース注意(英: Sparse Attention)とは、標準的な注意計算と準四分法的な複雑性を近似する手法である。 KVキャッシュのプルーニング、スパースベースの高速注意、スパーストランスフォーマーといったテクニックのバリエーションは、効率的なLLM(Large Language Models)デプロイメントに広く利用されている。
論文参考訳（メタデータ） (2024-04-03T12:37:34Z)
Fast $(1+\varepsilon)$-Approximation Algorithms for Binary Matrix Factorization [54.29685789885059]
本稿では, 2次行列分解(BMF)問題に対する効率的な$(1+varepsilon)$-approximationアルゴリズムを提案する。目標は、低ランク因子の積として$mathbfA$を近似することである。我々の手法はBMF問題の他の一般的な変種に一般化する。
論文参考訳（メタデータ） (2023-06-02T18:55:27Z)
Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension [18.57735939471469]
我々は注意問題のスパシフィケーションを考慮する。超大規模特徴量の場合、文の長さをほぼ線形に縮めることができる。
論文参考訳（メタデータ） (2023-04-10T05:52:38Z)
Fast Attention Requires Bounded Entries [19.17278873525312]
内部製品注意計算はTransformer, GPT-1, BERT, GPT-2, GPT-3, ChatGPTなどの大規模言語モデルを訓練するための基本的なタスクである。行列を暗黙的に$A$とすることで、より高速なアルゴリズムが可能かどうかを検討する。このことは、入力行列がより小さいエントリを持つ場合、注意計算の方がはるかに効率的である、実際に観察された現象の理論的な説明を与える。
論文参考訳（メタデータ） (2023-02-26T02:42:39Z)
Sketching Algorithms and Lower Bounds for Ridge Regression [65.0720777731368]
リッジ回帰問題に対する1+varepsilon$近似解を計算するスケッチベース反復アルゴリズムを提案する。また,このアルゴリズムがカーネルリッジ回帰の高速化に有効であることを示す。
論文参考訳（メタデータ） (2022-04-13T22:18:47Z)
Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
我々は、$AinmathbbRdtimes n$へのアクセスを考えると、潜入$k$-vertex simplex $KsubsetmathbbRdtimes n$を学習する問題を考える。実行時間における$k$への依存は、トップ$k$特異値の質量が$a$であるという自然な仮定から不要であることを示す。
論文参考訳（メタデータ） (2021-05-17T16:40:48Z)
Approximate Multiplication of Sparse Matrices with Limited Space [24.517908972536432]
我々はスパース共起方向を開発し、期待値の$widetildeOleft((nnz(X)+nnz(Y))ell+nell2right)$に時間複雑性を減少させる。理論的解析により,我々のアルゴリズムの近似誤差はCODとほぼ同値であることが判明した。
論文参考訳（メタデータ） (2020-09-08T05:39:19Z)
Streaming Complexity of SVMs [110.63976030971106]
本稿では,ストリーミングモデルにおけるバイアス正規化SVM問題を解く際の空間複雑性について検討する。両方の問題に対して、$frac1lambdaepsilon$の次元に対して、$frac1lambdaepsilon$よりも空間的に小さいストリーミングアルゴリズムを得ることができることを示す。
論文参考訳（メタデータ） (2020-07-07T17:10:00Z)
An Improved Cutting Plane Method for Convex Optimization, Convex-Concave Games and its Applications [28.54236118020831]
本稿では,最適な$O(n log (kappa))$オアラーク評価を用いた新しい切削平面アルゴリズムを提案する。また、評価毎に$n2$の時間を改善できないため、実行時間が最適であることを示す。
論文参考訳（メタデータ） (2020-04-08T20:56:40Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。