Fugu-MT 論文翻訳(概要): Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention

論文の概要: Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention

arxiv url: http://arxiv.org/abs/2303.01542v1
Date: Thu, 2 Mar 2023 19:18:11 GMT
ステータス: 翻訳完了
システム内更新日: 2023-03-06 17:15:27.493490
Title: Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention
Title（参考訳）: 視覚トランスフォーマーにおける自己着脱は注意ではなく知覚的グループ化を行う
Authors: Paria Mehrani and John K. Tsotsos
Abstract要約: 視覚変換器の注意機構は人間の視覚的注意と同様の効果を示す。その結果,色などの視覚的特徴の類似性から,自己注意モジュール群が刺激に現れることが示唆された。単トン検出実験において、これらのモデルが人間の視覚的注意に利用されるフィードフォワード視覚的サリエンス機構と類似した効果を示すかどうかを検討した。
参考スコア（独自算出の注目度）: 11.789983276366986
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recently, a considerable number of studies in computer vision involves deep neural architectures called vision transformers. Visual processing in these models incorporates computational models that are claimed to implement attention mechanisms. Despite an increasing body of work that attempts to understand the role of attention mechanisms in vision transformers, their effect is largely unknown. Here, we asked if the attention mechanisms in vision transformers exhibit similar effects as those known in human visual attention. To answer this question, we revisited the attention formulation in these models and found that despite the name, computationally, these models perform a special class of relaxation labeling with similarity grouping effects. Additionally, whereas modern experimental findings reveal that human visual attention involves both feed-forward and feedback mechanisms, the purely feed-forward architecture of vision transformers suggests that attention in these models will not have the same effects as those known in humans. To quantify these observations, we evaluated grouping performance in a family of vision transformers. Our results suggest that self-attention modules group figures in the stimuli based on similarity in visual features such as color. Also, in a singleton detection experiment as an instance of saliency detection, we studied if these models exhibit similar effects as those of feed-forward visual salience mechanisms utilized in human visual attention. We found that generally, the transformer-based attention modules assign more salience either to distractors or the ground. Together, our study suggests that the attention mechanisms in vision transformers perform similarity grouping and not attention.
Abstract（参考訳）: 近年、コンピュータビジョンにおけるかなりの数の研究は、ビジョントランスフォーマーと呼ばれる深層神経アーキテクチャを含んでいる。これらのモデルにおける視覚処理は、注意のメカニズムを実装すると主張する計算モデルを取り込んでいる。視覚トランスフォーマーにおける注意メカニズムの役割を理解しようとする作業が増えているが、その効果はほとんど分かっていない。ここでは、視覚変換器の注意機構が人間の視覚的注意と同様の効果を示すかどうかを問う。この疑問に答えるために、我々はこれらのモデルにおける注意の定式化を再考し、その名前にもかかわらず、計算上、これらのモデルが類似性グルーピング効果を持つ緩和ラベルの特別なクラスを実行することを発見した。さらに、現代の実験では、人間の視覚的注意がフィードフォワードとフィードバックのメカニズムの両方に関係していることが示されているが、視覚トランスフォーマーの純粋にフィードフォワードアーキテクチャは、これらのモデルにおける注意がヒトで知られているものと同じ効果を持たないことを示唆している。これらの観察を定量化するために,視覚トランスフォーマの群化性能を評価した。その結果,色などの視覚的特徴の類似性から,自己注意モジュール群が刺激に現れることが示唆された。また,サリエンシ検出の例としてシングルトン検出実験において,これらのモデルが人間の視覚的注意に利用されるフィードフォワード視覚的サリエンス機構と同様の効果を示すかどうかを検討した。一般に、トランスフォーマーベースのアテンションモジュールは、イントラクタまたはグラウンドに対してよりサリエンスを割り当てる。そこで本研究では,視覚トランスフォーマーの注意機構が類似性グループ化を行い,注意を払わないことを示唆する。

論文の概要: Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention

関連論文リスト