Fugu-MT 論文翻訳(概要): On The Application of Linear Attention in Multimodal Transformers

論文の概要: On The Application of Linear Attention in Multimodal Transformers

arxiv url: http://arxiv.org/abs/2604.10064v1
Date: Sat, 11 Apr 2026 07:06:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:15.820666
Title: On The Application of Linear Attention in Multimodal Transformers
Title（参考訳）: マルチモーダル変圧器における線形注意の適用について
Authors: Armin Gerami, Seyedehanita Madani, Ramani Duraiswami,
Abstract要約: マルチモーダルトランスフォーマーは最先端のビジョン言語モデルのバックボーンとして機能するが、その二次的な注意の複雑さはスケーラビリティにとって重要な障壁である。マルチモーダルフレームワークにおける高効率な代替手段としての線形注意(LA)の実現可能性について検討する。我々の系統的評価は,線形注意が計算量を大幅に削減するだけでなく,標準ソフトマックスの注意と同様のスケーリング法則に従うことを証明している。
参考スコア（独自算出の注目度）: 9.10734114158633
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Transformers serve as the backbone for state-of-the-art vision-language models, yet their quadratic attention complexity remains a critical barrier to scalability. In this work, we investigate the viability of Linear Attention (LA) as a high-efficiency alternative within multimodal frameworks. By integrating LA, we reduce the computational overhead from quadratic to linear relative to sequence length while preserving competitive performance. We evaluate our approach across ViT-S/16, ViT-B/16, and ViT-L/16 architectures trained on the LAION-400M dataset, with validation focused on ImageNet-21K zero-shot accuracy. Our systematic evaluation demonstrates that Linear Attention not only yields significant computational savings but also adheres to the same scaling laws as standard softmax attention. These findings position Linear Attention as a robust, scalable solution for next-generation multimodal Transformers tasked with processing increasingly large and complex datasets.
Abstract（参考訳）: マルチモーダルトランスフォーマーは最先端のビジョン言語モデルのバックボーンとして機能するが、その二次的な注意の複雑さはスケーラビリティにとって重要な障壁である。本研究では,マルチモーダルフレームワークにおける高効率な代替手段としての線形注意(LA)の実現可能性について検討する。 LAを統合することにより、競合性能を保ちながら、計算オーバーヘッドを2次から2次から線形に削減する。 LAION-400MデータセットでトレーニングしたViT-S/16, ViT-B/16, ViT-L/16アーキテクチャ間のアプローチを評価し, ImageNet-21Kゼロショット精度に着目した検証を行った。我々の体系的評価は,線形注意が計算量を大幅に削減するだけでなく,標準ソフトマックスの注意と同様のスケーリング法則に従うことを証明している。これらの結果から、Linear Attentionは、ますます大規模で複雑なデータセットを処理する、次世代マルチモーダルトランスフォーマーの堅牢でスケーラブルなソリューションとして位置づけられた。

論文の概要: On The Application of Linear Attention in Multimodal Transformers

関連論文リスト