Fugu-MT 論文翻訳(概要): Advancing Vision Transformer with Enhanced Spatial Priors

論文の概要: Advancing Vision Transformer with Enhanced Spatial Priors

arxiv url: http://arxiv.org/abs/2604.18549v1
Date: Mon, 20 Apr 2026 17:41:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:53.025192
Title: Advancing Vision Transformer with Enhanced Spatial Priors
Title（参考訳）: 空間優先の強化による視覚変換器の高速化
Authors: Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, Ran He,
Abstract要約: Vision Transformer (ViT) はコンピュータビジョンコミュニティにおいて大きな注目を集めている。我々は、一般的な目的のために、空間的先行性を明確にした頑健な視覚バックボーンであるRTTを提案する。 RMTの強みを生かしたEuclidean enhanced Vision Transformer (EVT)は、いくつかの重要な改善を取り入れた拡張版である。
参考スコア（独自算出の注目度）: 45.601974887796864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, the Vision Transformer (ViT) has garnered significant attention within the computer vision community. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and suffers from quadratic computational complexity, limiting its applicability. To address these issues, we have proposed RMT, a robust vision backbone with explicit spatial priors for general purposes. RMT utilizes Manhattan distance decay to introduce spatial information and employs a horizontal and vertical decomposition attention method to model global information. Building on the strengths of RMT, Euclidean enhanced Vision Transformer (EVT) is an expanded version that incorporates several key improvements. Firstly, EVT uses a more reasonable Euclidean distance decay to enhance the modeling of spatial information, allowing for a more accurate representation of spatial relationships compared to the Manhattan distance used in RMT. Secondly, EVT abandons the decomposed attention mechanism featured in RMT and instead adopts a simpler spatially-independent grouping approach, providing the model with greater flexibility in controlling the number of tokens within each group. By addressing these modifications, EVT offers a more sophisticated and adaptable approach to incorporating spatial priors into the Self-Attention mechanism, thus overcoming some of the limitations associated with RMT and further enhancing its applicability in various computer vision tasks. Extensive experiments on Image Classification, Object Detection, Instance Segmentation, and Semantic Segmentation demonstrate that EVT exhibits exceptional performance. Without additional training data, EVT achieves 86.6% top1-acc on ImageNet-1k.
Abstract（参考訳）: 近年、ビジョントランスフォーマー (ViT) はコンピュータビジョンコミュニティにおいて大きな注目を集めている。しかし、ViTのコアコンポーネントであるSelf-Attentionは、空間的先行性がなく、2次計算の複雑さに悩まされており、適用性が制限されている。これらの問題に対処するため、我々は、一般的な目的のために空間的先行を明示した頑健な視覚バックボーンであるRTTを提案した。 RMTはマンハッタン距離減衰を利用して空間情報を導入し、水平および垂直の分解注意法を用いて大域情報をモデル化する。 RMTの強みを生かしたEuclidean enhanced Vision Transformer (EVT)は、いくつかの重要な改善を取り入れた拡張版である。まず、EVTはより合理的なユークリッド距離減衰を用いて空間情報のモデリングを強化し、RTTで使用されるマンハッタン距離と比較してより正確な空間関係の表現を可能にする。第二に、EVTはRTTで特徴付けられる分解された注意機構を放棄し、代わりにより単純な空間非依存のグループ化アプローチを採用し、各グループ内のトークン数を制御する柔軟性を高めたモデルを提供する。これらの修正に対処することで、EVTはより洗練された適応可能なアプローチを提供し、空間的事前を自己認識機構に組み込むことで、RTTに関連するいくつかの制限を克服し、様々なコンピュータビジョンタスクにおける適用性を高める。画像分類、オブジェクト検出、インスタンスセグメンテーション、セマンティックセグメンテーションに関する大規模な実験は、EVTが異常な性能を示すことを示した。追加のトレーニングデータなしでは、EVTはImageNet-1kで86.6%のtop1-accを達成した。

論文の概要: Advancing Vision Transformer with Enhanced Spatial Priors

関連論文リスト