Fugu-MT 論文翻訳(概要): Uncovering hidden geometry in Transformers via disentangling position and context

論文の概要: Uncovering hidden geometry in Transformers via disentangling position and context

arxiv url: http://arxiv.org/abs/2310.04861v1
Date: Sat, 7 Oct 2023 15:50:26 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-12 14:56:20.425757
Title: Uncovering hidden geometry in Transformers via disentangling position and context
Title（参考訳）: 離間位置と文脈による変圧器内隠れ幾何の解明
Authors: Jiajun Song and Yiqiao Zhong
Abstract要約: トレーニングされた変換器の隠れ状態(または埋め込み)を解釈可能なコンポーネントに簡易に分解する。一般的なトランスフォーマーアーキテクチャや多様なテキストデータセットでは、経験的に広範に数学的構造が見つかる。
参考スコア（独自算出の注目度）: 0.6118897979046375
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers are widely used to extract complex semantic meanings from input tokens, yet they usually operate as black-box models. In this paper, we present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components. For any layer, embedding vectors of input sequence samples are represented by a tensor $\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$. Given embedding vector $\boldsymbol{h}_{c,t} \in \mathbb{R}^d$ at sequence position $t \le T$ in a sequence (or context) $c \le C$, extracting the mean effects yields the decomposition \[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c,t} \] where $\boldsymbol{\mu}$ is the global mean vector, $\mathbf{pos}_t$ and $\mathbf{ctx}_c$ are the mean vectors across contexts and across positions respectively, and $\mathbf{resid}_{c,t}$ is the residual vector. For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure: (1) $(\mathbf{pos}_t)_{t}$ forms a low-dimensional, continuous, and often spiral shape across layers, (2) $(\mathbf{ctx}_c)_c$ shows clear cluster structure that falls into context topics, and (3) $(\mathbf{pos}_t)_{t}$ and $(\mathbf{ctx}_c)_c$ are mutually incoherent -- namely $\mathbf{pos}_t$ is almost orthogonal to $\mathbf{ctx}_c$ -- which is canonical in compressed sensing and dictionary learning. This decomposition offers structural insights about input formats in in-context learning (especially for induction heads) and in arithmetic tasks.
Abstract（参考訳）: トランスフォーマーは入力トークンから複雑な意味を抽出するために広く使われているが、通常はブラックボックスモデルとして機能する。本稿では,訓練されたトランスフォーマの隠れた状態(あるいは埋め込み)を解釈可能なコンポーネントに簡易かつ有益に分解する。任意の層に対して、入力シーケンスサンプルの埋め込みベクトルはテンソル $\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$ で表される。 Given embedding vector $\boldsymbol{h}_{c,t} \in \mathbb{R}^d$ at sequence position $t \le T$ in a sequence (or context) $c \le C$, extracting the mean effects yields the decomposition \[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c,t} \] where $\boldsymbol{\mu}$ is the global mean vector, $\mathbf{pos}_t$ and $\mathbf{ctx}_c$ are the mean vectors across contexts and across positions respectively, and $\mathbf{resid}_{c,t}$ is the residual vector. For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure: (1) $(\mathbf{pos}_t)_{t}$ forms a low-dimensional, continuous, and often spiral shape across layers, (2) $(\mathbf{ctx}_c)_c$ shows clear cluster structure that falls into context topics, and (3) $(\mathbf{pos}_t)_{t}$ and $(\mathbf{ctx}_c)_c$ are mutually incoherent -- namely $\mathbf{pos}_t$ is almost orthogonal to $\mathbf{ctx}_c$ -- which is canonical in compressed sensing and dictionary learning. この分解は、インコンテキスト学習(特に誘導ヘッド)や算術タスクにおける入力形式に関する構造的な洞察を提供する。

関連論文リスト

Attention with Trained Embeddings Provably Selects Important Tokens [73.77633297039097]
トーケン埋め込みは言語モデリングにおいて重要な役割を担っているが、この実践的関連性にもかかわらず、理論的な理解は限られている。本論文は,勾配降下法により得られた埋め込み構造を特徴付けることにより,そのギャップを解消する。実世界のデータセット(IMDB、Yelp)の実験では、我々の理論が明らかにしたものに近い現象が示されている。
論文参考訳（メタデータ） (2025-05-22T21:00:09Z)
Efficient $1$-bit tensor approximations [1.104960878651584]
我々のアルゴリズムは、20ドルの擬似符号で効率よく符号付きカット分解を行う。オープンテキストMistral-7B-v0.1大言語モデルの重み行列を50%の空間圧縮に近似する。
論文参考訳（メタデータ） (2024-10-02T17:56:32Z)
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
単一インデックス対象関数 $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ の勾配勾配勾配学習問題について検討する。 SGDに基づくアルゴリズムにより最適化された2層ニューラルネットワークは、情報指数に支配されない複雑さで$f_*$を学習する。
論文参考訳（メタデータ） (2024-06-03T17:56:58Z)
Transformer In-Context Learning for Categorical Data [51.23121284812406]
我々は、分類結果、非線形基礎モデル、非線形注意を考慮し、文脈内学習のレンズを通してトランスフォーマーを理解する研究を機能データで拡張する。我々は、ImageNetデータセットを用いて、この数発の学習方法論の最初の実世界の実演であると考えられるものを提示する。
論文参考訳（メタデータ） (2024-05-27T15:03:21Z)
Provably learning a multi-head attention layer [55.2904547651831]
マルチヘッドアテンション層は、従来のフィードフォワードモデルとは分離したトランスフォーマーアーキテクチャの重要な構成要素の1つである。本研究では,ランダムな例から多面的注意層を実証的に学習する研究を開始する。最悪の場合、$m$に対する指数的依存は避けられないことを示す。
論文参考訳（メタデータ） (2024-02-06T15:39:09Z)
Families of costs with zero and nonnegative MTW tensor in optimal transport [0.0]
我々は、$mathsfc$のコスト関数を持つ$mathbbRn$上の最適輸送問題に対するMTWテンソルを明示的に計算する。我々は$sinh$-typeの双曲的コストを分析し、$mathsfc$-type関数と発散の例を提供する。
論文参考訳（メタデータ） (2024-01-01T20:33:27Z)
Increasing subsequences, matrix loci, and Viennot shadows [0.0]
商 $mathbbF[mathbfx_n times n]/I_n$ が標準単項基底を持つことを示す。また、 $mathbbF[mathbfx_n times n]/I_n$ を次数 $mathfrakS_n times MathfrakS_n$-module として計算する。
論文参考訳（メタデータ） (2023-06-14T19:48:01Z)
Learning a Single Neuron with Adversarial Label Noise via Gradient Descent [50.659479930171585]
モノトン活性化に対する $mathbfxmapstosigma(mathbfwcdotmathbfx)$ の関数について検討する。学習者の目標は仮説ベクトル $mathbfw$ that $F(mathbbw)=C, epsilon$ を高い確率で出力することである。
論文参考訳（メタデータ） (2022-06-17T17:55:43Z)
Random matrices in service of ML footprint: ternary random features with no performance loss [55.30329197651178]
我々は、$bf K$ の固有スペクトルが$bf w$ の i.d. 成分の分布とは独立であることを示す。 3次ランダム特徴(TRF)と呼ばれる新しいランダム手法を提案する。提案したランダムな特徴の計算には乗算が不要であり、古典的なランダムな特徴に比べてストレージに$b$のコストがかかる。
論文参考訳（メタデータ） (2021-10-05T09:33:49Z)
Self-training Converts Weak Learners to Strong Learners in Mixture Models [86.7137362125503]
擬似ラベルの $boldsymbolbeta_mathrmpl$ が,最大$C_mathrmerr$ の分類誤差を達成可能であることを示す。さらに、ロジスティックな損失に対して勾配降下を実行することで、ラベル付き例のみを使用して、分類誤差が$C_mathrmerr$で擬ラベルの $boldsymbolbeta_mathrmpl$ が得られることを示す。
論文参考訳（メタデータ） (2021-06-25T17:59:16Z)
Optimal Spectral Recovery of a Planted Vector in a Subspace [80.02218763267992]
我々は、$ell_4$ノルムが同じ$ell$ノルムを持つガウスベクトルと異なるプラントベクトル$v$の効率的な推定と検出について研究する。規則$n rho gg sqrtN$ では、大クラスのスペクトル法(そしてより一般的には、入力の低次法)は、植込みベクトルの検出に失敗する。
論文参考訳（メタデータ） (2021-05-31T16:10:49Z)
Learners' languages [0.0]
著者らは、深層学習の基本的な要素である勾配降下とバックプロパゲーションは、強いモノイド関手として概念化できることを示した。我々は$Ato B$ in $mathbfPara(mathbfSLens)$の写像が動的系の観点から自然な解釈を持っていることを示した。
論文参考訳（メタデータ） (2021-03-01T18:34:00Z)
Learning a Lie Algebra from Unlabeled Data Pairs [7.329382191592538]
深層畳み込みネットワーク (convnets) は、非絡み合った表現を学習する顕著な能力を示している。本稿では,空間$mathbbRn$の非線形変換を発見する機械学習手法を提案する。鍵となる考え方は、すべてのターゲット $boldsymboly_i$ を $boldsymbolwidetildey_i = boldsymbolphi(t_i) boldsymbolx_i$ という形の行列ベクトル積で近似することである。
論文参考訳（メタデータ） (2020-09-19T23:23:52Z)
A Canonical Transform for Strengthening the Local $L^p$-Type Universal Approximation Property [4.18804572788063]
任意の機械学習モデルクラス $mathscrFsubseteq C(mathbbRd,mathbbRD)$ が $Lp_mu(mathbbRd,mathbbRD)$ で密であることを保証する。本稿では、「$mathscrF$'s approximation property」という正準変換を導入することにより、この近似理論問題に対する一般的な解を提案する。
論文参考訳（メタデータ） (2020-06-24T17:46:35Z)
The Average-Case Time Complexity of Certifying the Restricted Isometry Property [66.65353643599899]
圧縮センシングにおいて、100万倍のN$センシング行列上の制限等尺性(RIP)はスパースベクトルの効率的な再構成を保証する。 Mtimes N$ matrices with i.d.$mathcalN(0,1/M)$ entry。
論文参考訳（メタデータ） (2020-05-22T16:55:01Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。