Fugu-MT 論文翻訳(概要): The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

論文の概要: The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

arxiv url: http://arxiv.org/abs/2402.04347v1
Date: Tue, 6 Feb 2024 19:31:26 GMT
ステータス: 翻訳完了
システム内更新日: 2024-02-08 18:06:14.082012
Title: The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
Title（参考訳）: hedgehog & the porcupine:softmaxの模倣による表現的線形注意
Authors: Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher R\'e
Abstract要約: 線形の注意はトランスフォーマーの効率を改善する可能性を示し、注意の2次複雑さを線形のシーケンス長に減らした。線形複雑性を保ちながらソフトマックスアテンションのスパイク特性とモノトニック特性を保持する学習可能な線形アテンションであるHedgehogを提案する。
参考スコア（独自算出の注目度）: 24.198536617002667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) "finetuned-conversion" of task-specific Transformers into linear versions that recover task performance, and (3) "pretrained-conversion" of Transformers such as large language models into linear versions finetunable on downstream tasks. However, linear attentions often underperform standard softmax attention in quality. To close this performance gap, we find prior linear attentions lack key properties of softmax attention tied to good performance: low-entropy (or "spiky") weights and dot-product monotonicity. We further observe surprisingly simple feature maps that retain these properties and match softmax performance, but are inefficient to compute in linear attention. We thus propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity. Hedgehog uses simple trainable MLPs to produce attention weights mimicking softmax attention. Experiments show Hedgehog recovers over 99% of standard Transformer quality in train-from-scratch and finetuned-conversion settings, outperforming prior linear attentions up to 6 perplexity points on WikiText-103 with causal GPTs, and up to 8.7 GLUE score points on finetuned bidirectional BERTs. Hedgehog also enables pretrained-conversion. Converting a pretrained GPT-2 into a linear attention variant achieves state-of-the-art 16.7 perplexity on WikiText-103 for 125M subquadratic decoder models. We finally turn a pretrained Llama-2 7B into a viable linear attention Llama. With low-rank adaptation, Hedgehog-Llama2 7B achieves 28.1 higher ROUGE-1 points over the base standard attention model, where prior linear attentions lead to 16.5 point drops.
Abstract（参考訳）: 線形の注意はトランスフォーマーの効率を改善する可能性を示し、注意の2次複雑さを線形のシーケンス長に減らした。これは(1)スクラッチからリニアトランスをトレーニングすること、(2)タスク固有のトランスフォーマーをリニアバージョンに変換してタスクパフォーマンスを回復すること、(3)大きな言語モデルのようなトランスフォーマーを下流タスクで微調整可能なリニアバージョンに事前変換すること、のエキサイティングな約束を持っている。しかし、リニアアテンションは、品質において標準的なソフトマックスアテンションを過小評価することが多い。この性能ギャップを埋めるために、以前の線形の注意は、低エントロピー(または「スパイキー」)重みとドット生成単調性(英語版)という、優れた性能に結びついたソフトマックスの注意の鍵的特性を欠いている。さらに,これらの特性を保ち,ソフトマックス性能に適合するが,線形注意で計算するには非効率な,驚くほど単純な特徴マップも観察する。そこで我々は,線形複雑性を維持しつつ,ソフトマックスアテンションのスパイク特性とモノトニック特性を保持する学習可能な線形アテンションであるHedgehogを提案する。 Hedgehogは単純なトレーニング可能なMPPを使用して、ソフトマックスの注意を模倣する注意重みを生成する。実験の結果、Hedgehogは電車からの変圧器の標準品質の99%以上を回復し、WikiText-103の6点の難易度点と微調整された双方向BERTの8.7点のGLUEスコアを上回った。 Hedgehogは事前訓練された変換も可能にする。事前訓練されたGPT-2を線形アテンション変種に変換することで、125Mのサブクワッドラティックデコーダモデルに対して、WikiText-103で最先端の16.7パープレキシティを実現する。トレーニング済みのLlama-2 7BをリニアアテンションLlamaに変換する。低ランク適応では、Hedgehog-Llama2 7Bは標準の注意モデルよりも28.1高いROUGE-1点を達成する。

論文の概要: The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

関連論文リスト