Fugu-MT 論文翻訳(概要): Attention Transfer Is Not Universally Effective for Vision Transformers

論文の概要: Attention Transfer Is Not Universally Effective for Vision Transformers

arxiv url: http://arxiv.org/abs/2605.07191v1
Date: Fri, 08 May 2026 03:39:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.77588
Title: Attention Transfer Is Not Universally Effective for Vision Transformers
Title（参考訳）: 視覚変換器の注意伝達は普遍的に有効ではない
Authors: Huaiyuan Qin, Muli Yang, Gabriel James Goenawan, Peng Hu, Chen Gong, Xi Peng, Hongyuan Zhu,
Abstract要約: 我々はこの発見を、有名な11のVTファミリーの20人の教師のベンチマークで再考する。 7家族の移動は成功したが、4家族が一貫して失敗し、5.1%まで下がった。事前学習した教師と標準学生のアーキテクチャミスマッチを主要なメカニズムとみなす。
参考スコア（独自算出の注目度）: 47.26921741602587
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A recent work shows that Attention Transfer, which transfers only the attention patterns from a pre-trained teacher Vision Transformer (ViT) to a randomly initialized standard student ViT, is sufficient to recover the full benefit of the teacher's pre-trained weights. We revisit this finding on a comprehensive benchmark of 20 teachers from 11 well-known ViT families and reveal that Attention Transfer is not universally effective. While 7 families transfer successfully, 4 consistently fail, falling up to 5.1\% below the from-scratch no-transfer baseline. Further results demonstrate that this failure is family-consistent across model sizes, and persists under extended training durations, different transfer datasets, and out-of-distribution evaluations. Controlled analyses then consistently localize the problem to the attention-routing channel, indicating that the key issue is not whether the student can match the teacher's attention patterns, but whether the matched patterns remain functional for the student. Crucially, we identify architectural mismatch between the pre-trained teacher and the standard student as the primary mechanism. By adding only the teacher's native architectural components to the student in a randomly initialized state, we completely reverse the failure for all 4 families. Notably, these components alone do not improve from-scratch training, confirming that they specifically unlock the usability of the teacher's attention. We further systematically show that this failure is not explained by the inadequate choice of transfer loss or by differences in pre-training recipes. Our findings refine the prevailing understanding of attention in ViT representations: attention is sufficient \textit{only} when the student architecture matches the teacher.
Abstract（参考訳）: 近年の研究では、事前学習した教師ビジョントランスフォーマー(ViT)からランダムに初期化された標準学生ViTへの注意パターンのみを伝達するアテンショントランスファーが、教師の事前学習した重量の完全なメリットを回復するのに十分であることを示す。我々はこの発見を、11の有名なViTファミリーの20人の教師による総合的なベンチマークで再考し、注意伝達が普遍的に有効でないことを明らかにする。 7家族の移動は成功したが、4家族は常に失敗し、5.1 % まで下がった。さらなる結果は、この失敗はモデルサイズ全体にわたってファミリー一貫性があり、トレーニング期間、異なる転送データセット、アウト・オブ・ディストリビューション評価の下で持続することを示している。教師の注意パターンにマッチするかどうかではなく、学習者にとって一致したパターンが機能するかどうかである。重要な点として,事前学習した教師と標準学生のアーキテクチャミスマッチを主要なメカニズムとして同定する。教師のネイティブなアーキテクチャコンポーネントのみをランダムに初期化状態で生徒に追加することで、4つのファミリーの失敗を完全に逆転する。特に、これらのコンポーネントだけでは、教師の注意力の活用性を特別に解き放つことを確認する、アウトスクラッチトレーニングは改善されない。さらに、この失敗は、転送損失の不適切な選択や、事前学習のレシピの違いによって説明できないことを系統的に示す。学生アーキテクチャが教師と一致する場合, 注意が十分である。

論文の概要: Attention Transfer Is Not Universally Effective for Vision Transformers

関連論文リスト