Fugu-MT 論文翻訳(概要): LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

論文の概要: LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

arxiv url: http://arxiv.org/abs/2509.18467v1
Date: Mon, 22 Sep 2025 22:43:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 20:41:27.608619
Title: LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling
Title（参考訳）: LAWCAT:長期文脈モデリングのためのトークン間の畳み込みを伴う二次から線形への効率よい蒸留
Authors: Zeyu Liu, Souvik Kundu, Lianghao Jiang, Anni Li, Srikanth Ronanki, Sravan Bodapati, Gourav Datta, Peter A. Beerel,
Abstract要約: 本稿では,事前学習した変圧器の性能を線形アテンションアーキテクチャに効率よく伝達する新しい線形化フレームワークを提案する。 LawCATは因果Conv1Dレイヤを統合し、局所的な依存性モデリングを強化し、標準化されたゲート線形アテンションを用いて、さまざまなコンテキストの長さにわたる一般化を改善する。評価の結果,Mistral-7Bを1Kの配列で蒸留すると,パスキー検索精度が最大で22Kまで向上し,有効コンテキストウインドウが大幅に拡張された。
参考スコア（独自算出の注目度）: 27.045621004239067
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90\% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1\&2\&3 tasks (1K-8K context length) and BABILong benchmark (QA2\&QA3, 0K-16K context length), requiring less than 0.1\% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster prefill speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources.
Abstract（参考訳）: トランスフォーマーアーキテクチャは、様々な領域にわたる最先端のパフォーマンスを達成したが、シーケンス長に関する2次計算の複雑さは、特にレイテンシに敏感なロングコンテキストアプリケーションにおいて、大きなボトルネックとなっている。最近の線形複雑性の代替案はますます強力になっているが、スクラッチから効果的にそれらを訓練することは依然として資源集約的である。これらの制約を克服するために, LAWCAT (Linear Attention with Convolution Across Time) を提案する。 LAWCATは因果Conv1D層を統合し、局所的な依存性モデリングを強化し、正規化されたゲート付き線形アテンションを用いてコンテキスト長の一般化を改善する。総括評価の結果,Mistral-7Bを1Kの配列で蒸留するとパスキー検索精度が90%以上なり,最大22Kのトークンが得られた。同様に、Llama3.2-1B LAWCATはS-NIAH 1\&2\&3タスク(1K-8Kコンテキスト長)とBABILongベンチマーク(QA2\&QA3, 0K-16Kコンテキスト長)の競合性能を達成し、事前トレーニングモデルと比較して0.1\%未満の事前トレーニングトークンを必要とする。さらに、LAWCATは8Kトークンを超えるシーケンスに対して、FlashAttention-2よりも高速なプリフィル速度を示す。これにより、LAWCATはエッジ展開に適した高性能で長期コンテキストの線形モデルへの効率的な経路を提供し、広範囲の長期トレーニングデータや計算資源への依存を減らすことができる。

論文の概要: LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

関連論文リスト