Fugu-MT 論文翻訳(概要): Manifold-Preserving Transformers are Effective for Short-Long Range Encoding

論文の概要: Manifold-Preserving Transformers are Effective for Short-Long Range Encoding

arxiv url: http://arxiv.org/abs/2310.14206v1
Date: Sun, 22 Oct 2023 06:58:28 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-25 01:13:13.431576
Title: Manifold-Preserving Transformers are Effective for Short-Long Range Encoding
Title（参考訳）: 多様体保存トランスは短距離符号化に有効である
Authors: Ayan Sengupta, Md Shad Akhtar and Tanmoy Chakraborty
Abstract要約: マルチヘッドセルフアテンションベースのトランスフォーマーは、異なる学習タスクにおいて有望であることを示す。本研究では,一対のトークン間の層間距離保存を理論的に保証するエンコーダモデルTransJectを提案する。
参考スコア（独自算出の注目度）: 39.14128923434994
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-head self-attention-based Transformers have shown promise in different learning tasks. Albeit these models exhibit significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformers and their variants fail to preserve layer-wise contextual information. Transformers usually project tokens onto sparse manifolds and fail to preserve mathematical equivalence among the token representations. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens. We propose a simple alternative to dot-product attention to ensure Lipschitz continuity. This allows TransJect to learn injective mappings to transform token representations to different manifolds with similar topology and preserve Euclidean distance between every pair of tokens in subsequent layers. Evaluations across multiple benchmark short- and long-sequence classification tasks show maximum improvements of 6.8% and 5.9%, respectively, over the variants of Transformers. Additionally, TransJect displays 79% better performance than Transformer on the language modeling task. We further highlight the shortcomings of multi-head self-attention from the statistical physics viewpoint. Although multi-head self-attention was incepted to learn different abstraction levels within the networks, our empirical analyses suggest that different attention heads learn randomly and unorderly. In contrast, TransJect adapts a mixture of experts for regularization; these experts are more orderly and balanced and learn different sparse representations from the input sequences. TransJect exhibits very low entropy and can be efficiently scaled to larger depths.
Abstract（参考訳）: マルチヘッドセルフアテンションベースのトランスフォーマーは、さまざまな学習タスクで期待されている。これらのモデルは、シークエンス、トランスフォーマーのエンコーダ、およびそれらの変種からの短期的および長期的コンテキストの理解において、大きな改善を示す。トランスフォーマーは通常、トークンをスパース多様体に射影し、トークン表現間の数学的等価性を維持するのに失敗する。本研究では,一対のトークン間の層間距離保存の理論的境界を保証するエンコーダモデルであるtransjectを提案する。リプシッツ連続性を確保するために,点生成的注意の簡易な代替案を提案する。これにより、トランジェクションは射影写像を学習し、同様のトポロジーを持つ異なる多様体へのトークン表現を変換し、続く層内のすべてのトークン間のユークリッド距離を保存することができる。複数のベンチマークのショートシーケンスとロングシーケンスの分類タスクに対する評価は、トランスフォーマーの変種よりも最大6.8%と5.9%の改善を示している。さらに、TransJectは言語モデリングタスクでTransformerよりも79%パフォーマンスが向上している。統計物理学の観点から,マルチヘッド自己注意の欠点をさらに強調する。マルチヘッド・セルフ・アテンションはネットワーク内で異なる抽象レベルを学ぶために始められたが、実験的な分析から異なる注意ヘッドがランダムに無秩序に学習することを示唆している。対照的に、transjectは正規化のために専門家の混合物に適応する;これらの専門家はより秩序とバランスを持ち、入力シーケンスから異なるスパース表現を学ぶ。トランスジェクトは非常に低いエントロピーを示し、より深い深さまで効率的にスケールできる。

論文の概要: Manifold-Preserving Transformers are Effective for Short-Long Range Encoding

関連論文リスト