Fugu-MT 論文翻訳(概要): Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

論文の概要: Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

arxiv url: http://arxiv.org/abs/2304.04397v1
Date: Mon, 10 Apr 2023 05:52:38 GMT
ステータス: 翻訳完了
システム内更新日: 2023-04-11 15:57:20.660299
Title: Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension
Title（参考訳）: 過パラメータ化特徴次元に対するランダム化および決定論的アテンションスペーシフィケーションアルゴリズム
Authors: Yichuan Deng, Sridhar Mahadevan, Zhao Song
Abstract要約: 我々は注意問題のスパシフィケーションを考慮する。超大規模特徴量の場合、文の長さをほぼ線形に縮めることができる。
参考スコア（独自算出の注目度）: 18.57735939471469
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) have shown their power in different areas. Attention computation, as an important subroutine of LLMs, has also attracted interests in theory. Recently the static computation and dynamic maintenance of attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and Zhou 2023] from both algorithmic perspective and hardness perspective. In this work, we consider the sparsification of the attention problem. We make one simplification which is the logit matrix is symmetric. Let $n$ denote the length of sentence, let $d$ denote the embedding dimension. Given a matrix $X \in \mathbb{R}^{n \times d}$, suppose $d \gg n$ and $\| X X^\top \|_{\infty} < r$ with $r \in (0,0.1)$, then we aim for finding $Y \in \mathbb{R}^{n \times m}$ (where $m\ll d$) such that \begin{align*} \| D(Y)^{-1} \exp( Y Y^\top ) - D(X)^{-1} \exp( X X^\top) \|_{\infty} \leq O(r) \end{align*} We provide two results for this problem. $\bullet$ Our first result is a randomized algorithm. It runs in $\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} ) $ time, has $1-\delta$ succeed probability, and chooses $m = O(n \log(n/\delta))$. Here $\mathrm{nnz}(X)$ denotes the number of non-zero entries in $X$. We use $\omega$ to denote the exponent of matrix multiplication. Currently $\omega \approx 2.373$. $\bullet$ Our second result is a deterministic algorithm. It runs in $\widetilde{O}(\min\{\sum_{i\in[d]}\mathrm{nnz}(X_i)^2, dn^{\omega-1}\} + n^{\omega+1})$ time and chooses $m = O(n)$. Here $X_i$ denote the $i$-th column of matrix $X$. Our main findings have the following implication for applied LLMs task: for any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence.
Abstract（参考訳）: 大規模言語モデル (LLM) は様々な分野でその能力を示している。 LLMの重要なサブルーチンとしての注意計算も理論に興味を惹きつけている。近年,[Alman and Song 2023] と [Brand, Song, Zhou 2023] による注目行列の静的計算と動的維持について,アルゴリズム的視点と硬さの観点から検討している。本稿では,注意問題のスパース化について考察する。ロジット行列が対称である1つの単純化を行う。 n$ は文の長さを表し、$d$ は埋め込み次元を表す。行列 $x \in \mathbb{r}^{n \times d}$ を与えられたとき、$d \gg n$ と $\| x x^\top \|_{\infty} < r$ with $r \in (0,0.1)$ と仮定すると、我々は$y \in \mathbb{r}^{n \times m}$ (ここで$m\ll d$) を \begin{align*} \| d(y)^{-1} \exp(y y^\top )d(x)^{-1} \exp(x^\top) \|_{\infty} \leq o(r) \end{align*} として求める。 $\bullet$ 最初の結果はランダム化アルゴリズムです。これは$\widetilde{o}(\mathrm{nnz}(x) + n^{\omega} ) $ time で動作し、1-\delta$ succeed 確率を持ち、$m = o(n \log(n/\delta))$ を選択する。ここで $\mathrm{nnz}(X)$ は$X$ の 0 でないエントリの数を表す。行列乗算の指数を表すために$\omega$を使用する。現在$\omega \approx 2.373$である。 $\bullet$ 2番目の結果は決定論的アルゴリズムです。これは$\widetilde{O}(\min\{\sum_{i\in[d]}\mathrm{nnz}(X_i)^2, dn^{\omega-1}\} + n^{\omega+1})$ timeで実行され、$m = O(n)$を選択する。ここで、$X_i$は行列の$i$-番目の列を表す。本研究の主目的は, LLMs タスクを適用した場合, 超大容量の特徴量に対して, 文の長さがほぼ直線的なサイズに縮小できる点である。

関連論文リスト

The Communication Complexity of Approximating Matrix Rank [50.6867896228563]
この問題は通信複雑性のランダム化を$Omega(frac1kcdot n2log|mathbbF|)$とする。アプリケーションとして、$k$パスを持つ任意のストリーミングアルゴリズムに対して、$Omega(frac1kcdot n2log|mathbbF|)$スペースローバウンドを得る。
論文参考訳（メタデータ） (2024-10-26T06:21:42Z)
Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
線形スケッチを用いた行列とベクトルノルムの残差誤差推定問題について検討する。これは、前作とほぼ同じスケッチサイズと精度で、経験的にかなり有利であることを示す。また、スパースリカバリ問題に対して$Omega(k2/pn1-2/p)$低いバウンダリを示し、これは$mathrmpoly(log n)$ factorまで厳密である。
論文参考訳（メタデータ） (2024-08-16T02:33:07Z)
A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time [7.613259578185218]
我々は、一層注意ネットワーク目的関数 $L(X,Y) の証明可能な保証を提供することに注力する。多層LCMネットワークでは、mathbbRn×d2$の行列$Bを層の出力と見なすことができる。損失関数をトレーニングする反復アルゴリズムを$L(X,Y)$ up $epsilon$で、$widetildeO( (cal T_mathrmmat(n,d) + dで実行される。
論文参考訳（メタデータ） (2023-09-14T04:23:40Z)
Matrix Completion in Almost-Verification Time [37.61139884826181]
99%の行と列で$mathbfM$を完了するアルゴリズムを提供する。本稿では,この部分完備保証を完全行列補完アルゴリズムに拡張する方法を示す。
論文参考訳（メタデータ） (2023-08-07T15:24:49Z)
Fast $(1+\varepsilon)$-Approximation Algorithms for Binary Matrix Factorization [54.29685789885059]
本稿では, 2次行列分解(BMF)問題に対する効率的な$(1+varepsilon)$-approximationアルゴリズムを提案する。目標は、低ランク因子の積として$mathbfA$を近似することである。我々の手法はBMF問題の他の一般的な変種に一般化する。
論文参考訳（メタデータ） (2023-06-02T18:55:27Z)
Fast Attention Requires Bounded Entries [19.17278873525312]
内部製品注意計算はTransformer, GPT-1, BERT, GPT-2, GPT-3, ChatGPTなどの大規模言語モデルを訓練するための基本的なタスクである。行列を暗黙的に$A$とすることで、より高速なアルゴリズムが可能かどうかを検討する。このことは、入力行列がより小さいエントリを持つ場合、注意計算の方がはるかに効率的である、実際に観察された現象の理論的な説明を与える。
論文参考訳（メタデータ） (2023-02-26T02:42:39Z)
A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee [16.409210914237086]
行列 $Ain mathbbRntimes d$ とテンソル $bin mathbbRn$ が与えられたとき、 $ell_infty$ の回帰問題を考える。このような$ell_infty$レグレッションの保証を得るためには、濃密なスケッチ行列を使わなければならない。我々はまた、OCE(Oblivious Coordinate-wise Embedding)特性を利用した $ell_infty$ guarantee regression のための新しい分析フレームワークを開発した。
論文参考訳（メタデータ） (2023-02-01T05:22:40Z)
Sketching Algorithms and Lower Bounds for Ridge Regression [65.0720777731368]
リッジ回帰問題に対する1+varepsilon$近似解を計算するスケッチベース反復アルゴリズムを提案する。また,このアルゴリズムがカーネルリッジ回帰の高速化に有効であることを示す。
論文参考訳（メタデータ） (2022-04-13T22:18:47Z)
Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
我々は、$AinmathbbRdtimes n$へのアクセスを考えると、潜入$k$-vertex simplex $KsubsetmathbbRdtimes n$を学習する問題を考える。実行時間における$k$への依存は、トップ$k$特異値の質量が$a$であるという自然な仮定から不要であることを示す。
論文参考訳（メタデータ） (2021-05-17T16:40:48Z)
The Average-Case Time Complexity of Certifying the Restricted Isometry Property [66.65353643599899]
圧縮センシングにおいて、100万倍のN$センシング行列上の制限等尺性(RIP)はスパースベクトルの効率的な再構成を保証する。 Mtimes N$ matrices with i.d.$mathcalN(0,1/M)$ entry。
論文参考訳（メタデータ） (2020-05-22T16:55:01Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。