Fugu-MT 論文翻訳(概要): ViT-AdaLA: Adapting Vision Transformers with Linear Attention

論文の概要: ViT-AdaLA: Adapting Vision Transformers with Linear Attention

arxiv url: http://arxiv.org/abs/2603.16063v1
Date: Tue, 17 Mar 2026 02:15:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.070688
Title: ViT-AdaLA: Adapting Vision Transformers with Linear Attention
Title（参考訳）: ViT-AdaLA: 線形注意による視覚変換器の適応
Authors: Yifan Li, Seunghyun Yoon, Viet Dac Lai, Franck Dernoncourt, Jason Kuen, Yu Kong, Trung Bui,
Abstract要約: ビジョントランスフォーマー (ViT) ベースの視覚基礎モデル (VFM) は、様々な視覚タスクにおいて顕著な性能を達成している。 ViTの既存の線形アテンションアプローチは、通常、スクラッチから訓練され、かなりの計算資源を必要とする。本稿では,VFMから線形注意への事前知識の適応と伝達を効果的に行う新しいフレームワークであるViT-AdaLAを提案する。
参考スコア（独自算出の注目度）: 71.36851471416034
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.
Abstract（参考訳）: ビジョントランスフォーマー (ViT) ベースの視覚基礎モデル (VFM) は、様々な視覚タスクにまたがる優れた性能を達成しているが、スケーラビリティを長いシーケンスに制限する二次的な複雑さに悩まされている。既存のViTに対する線形アテンションアプローチは、通常、スクラッチから訓練され、かなりの計算資源を必要とするが、大規模言語モデルデコーダ用に開発された線形化ベースの手法は、ViTにうまく移行しない。これらの課題に対処するために、VFMから線形注意への事前知識の適応と伝達を効果的に行う新しいフレームワークであるViT-AdaLAを提案する。 ViT-AdaLAは、アテンションアライメント、特徴アライメント、教師付き微調整の3段階からなる。注目アライメント段階において、各ブロックにおけるバニラ線形アライメントと元のソフトマックスベースアライメントを一致させて、ソフトマックスアライメントの挙動を近似する。しかし、残差近似誤差は必然的に層間に蓄積する。我々は、リニアライズされたViTを微調整して、最終層の特徴を凍結ソフトマックスVFM教師と整合させることにより、これを緩和する。最後に、適応された事前知識は教師付き微調整によって下流タスクに転送される。分類とセグメンテーションタスクに関する広範な実験は、様々な最先端の線形注意相手に対するViT-AdaLAの有効性と一般性を示している。

論文の概要: ViT-AdaLA: Adapting Vision Transformers with Linear Attention

関連論文リスト