Fugu-MT 論文翻訳(概要): Max-Margin Token Selection in Attention Mechanism

論文の概要: Max-Margin Token Selection in Attention Mechanism

arxiv url: http://arxiv.org/abs/2306.13596v2
Date: Tue, 27 Jun 2023 08:28:40 GMT
ステータス: 翻訳完了
システム内更新日: 2023-06-28 10:17:20.359917
Title: Max-Margin Token Selection in Attention Mechanism
Title（参考訳）: 留意機構におけるマックスマージントークンの選択
Authors: Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, Samet Oymak
Abstract要約: 我々は、$boldsymbolp$ あるいは $boldW$ の勾配勾配降下が最大マルジン解に収束することを証明する。注目すべきは、我々の結果は一般的なデータに適用でき、正確には最適なトークン選択である。
参考スコア（独自算出の注目度）: 28.406996136801006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model $f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$, where $\boldsymbol{X}$ is the token sequence and $(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$ are trainable parameters. We prove that running gradient descent on $\boldsymbol{p}$, or equivalently $\boldsymbol{W}$, converges in direction to a max-margin solution that separates $\textit{locally-optimal}$ tokens from non-optimal ones. This clearly formalizes attention as an optimal token selection mechanism. Remarkably, our results are applicable to general data and precisely characterize $\textit{optimality}$ of tokens in terms of the value embeddings $\boldsymbol{Xv}$ and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing $\boldsymbol{v}$ and $\boldsymbol{p}$ simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where $\boldsymbol{v}$ separates the input features based on their labels. Interestingly, the SVM formulation of $\boldsymbol{p}$ is influenced by the support vector geometry of $\boldsymbol{v}$. Finally, we verify our theoretical findings via numerical experiments and provide insights.
Abstract（参考訳）: 注意機構はトランスフォーマーアーキテクチャの中心的な構成要素であり、大きな言語モデルの驚くべき成功につながった。しかし、注意機構の根底にある理論原理は、特に非凸最適化力学の理解が不十分である。この研究において、seminal softmax-attention model $f(\boldsymbol{x})=\langle \boldsymbol{xv}, \textt{softmax}(\boldsymbol{xwp})\rangle$、ここで$\boldsymbol{x}$はトークンシーケンス、$(\boldsymbol{v},\boldsymbol{w},\boldsymbol{p})$はトレーニング可能なパラメータである。我々は、$\boldsymbol{p}$ あるいは $\boldsymbol{W}$ の勾配勾配が、最適でないものから $\textit{locally-optimal}$ トークンを分離する最大マルジン解に収束することを証明している。これは注意を最適なトークン選択機構として明確に定式化する。注目すべきは、我々の結果は一般的なデータに適用でき、$\textit{optimality}$を値埋め込みの$\boldsymbol{Xv}$と問題幾何学で正確に特徴付けることである。また,非線形予測ヘッドにおいても注意の限界を最大化する広い正規化経路解析を提供する。ロジスティック損失とともに$\boldsymbol{v}$と$\boldsymbol{p}$を最適化するとき、正規化パスがそれぞれのハードマージンSVMソリューションに方向収束する条件を特定し、$\boldsymbol{v}$はラベルに基づいて入力特徴を分離する。興味深いことに、$\boldsymbol{p}$のsvm定式化は$\boldsymbol{v}$のサポートベクトル幾何に影響されている。最後に, 数値実験により理論的知見を検証し, 洞察を与える。

関連論文リスト

Attention with Trained Embeddings Provably Selects Important Tokens [73.77633297039097]
トーケン埋め込みは言語モデリングにおいて重要な役割を担っているが、この実践的関連性にもかかわらず、理論的な理解は限られている。本論文は,勾配降下法により得られた埋め込み構造を特徴付けることにより,そのギャップを解消する。実世界のデータセット(IMDB、Yelp)の実験では、我々の理論が明らかにしたものに近い現象が示されている。
論文参考訳（メタデータ） (2025-05-22T21:00:09Z)
Fast Debiasing of the LASSO Estimator [3.554868356768806]
高次元スパース回帰では、textscLasso 推定器は優れた理論的保証を提供するが、偏りのある推定を生成することはよく知られている。ランダムな準ガウス感知覚行列 $boldsymbolA$ に対する textscLasso 推定値について「脱バイアス法」を導入する。
論文参考訳（メタデータ） (2025-02-27T06:59:17Z)
Solving Quadratic Systems with Full-Rank Matrices Using Sparse or Generative Priors [33.0212223058894]
二次系$y_i=boldsymbol xtopboldsymbol A_iboldsymbol x, i=1,ldots,m$とフルランク行列$boldsymbol A_i$からの信号を回復する問題は、未割り当て距離幾何学やサブ波長イメージングなどの応用で頻繁に発生する。本稿では、$mll n$ が $boldsymbol x$ の事前知識を取り入れた高次元の場合について述べる。
論文参考訳（メタデータ） (2023-09-16T16:00:07Z)
Transformers as Support Vector Machines [54.642793677472724]
自己アテンションの最適化幾何と厳密なSVM問題との間には,形式的等価性を確立する。勾配降下に最適化された1層変圧器の暗黙バイアスを特徴付ける。これらの発見は、最適なトークンを分離し選択するSVMの階層としてのトランスフォーマーの解釈を刺激していると信じている。
論文参考訳（メタデータ） (2023-08-31T17:57:50Z)
High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation [89.21686761957383]
2層ネットワークにおける第1層パラメータ $boldsymbolW$ の勾配降下ステップについて検討した。我々の結果は、一つのステップでもランダムな特徴に対してかなりの優位性が得られることを示した。
論文参考訳（メタデータ） (2022-05-03T12:09:59Z)
Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
既知の滑らかな関数 $f$ の値を未知の点 $boldsymbolmu in mathbbRn$ で推定する問題について検討する。我々は、各座標の重要性に応じてサンプルを学習するインスタンス適応アルゴリズムを設計し、少なくとも1-delta$の確率で$epsilon$の正確な推定値である$f(boldsymbolmu)$を返す。
論文参考訳（メタデータ） (2022-03-18T18:50:52Z)
On the Self-Penalization Phenomenon in Feature Selection [69.16452769334367]
カーネル群に基づく暗黙の空間性誘導機構について述べる。アプリケーションとしては、この疎結合誘導機構を使用して、特徴選択に一貫性のあるアルゴリズムを構築します。
論文参考訳（メタデータ） (2021-10-12T09:36:41Z)
Self-training Converts Weak Learners to Strong Learners in Mixture Models [86.7137362125503]
擬似ラベルの $boldsymbolbeta_mathrmpl$ が,最大$C_mathrmerr$ の分類誤差を達成可能であることを示す。さらに、ロジスティックな損失に対して勾配降下を実行することで、ラベル付き例のみを使用して、分類誤差が$C_mathrmerr$で擬ラベルの $boldsymbolbeta_mathrmpl$ が得られることを示す。
論文参考訳（メタデータ） (2021-06-25T17:59:16Z)
Extensions to the Proximal Distance Method of Constrained Optimization [7.813460653362097]
損失 $f(boldsymbolx)$ を S$ の $boldsymbolx の形に制約する問題について検討する。融合制約は、滑らかさ、疎さ、あるいはより一般的な制約パターンをキャプチャすることができる。
論文参考訳（メタデータ） (2020-09-02T03:32:41Z)
Optimal Combination of Linear and Spectral Estimators for Generalized Linear Models [59.015960528781115]
最適に $hatboldsymbol xrm L$ と $hatboldsymbol xrm s$ を組み合わせる方法を示す。我々は,$(boldsymbol x, hatboldsymbol xrm L, hatboldsymbol xrm s)$の制限分布を確立するために,Adroximate Message Passing (AMP)アルゴリズムの設計と解析を行う。
論文参考訳（メタデータ） (2020-08-07T18:20:05Z)
Pareto Active Learning with Gaussian Processes and Adaptive Discretization [12.179548969182573]
GPサンプリング関数の滑らかさと$(cal X,d)$の構造を利用して高速に学習するアルゴリズムを提案する。本質的に、Adaptive $boldsymbolepsilon$-PALは木に基づく適応離散化技術を用いて、$boldsymbolepsilon$-accurate Paretoの設計セットを特定する。
論文参考訳（メタデータ） (2020-06-24T21:27:27Z)
The generalization error of max-margin linear classifiers: Benign overfitting and high dimensional asymptotics in the overparametrized regime [11.252856459394854]
現代の機械学習分類器は、トレーニングセットに消滅する分類誤差を示すことが多い。これらの現象に触発され、線形分離可能なデータに対する高次元の最大マージン分類を再検討する。
論文参考訳（メタデータ） (2019-11-05T00:15:27Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。