Fugu-MT 論文翻訳(概要): When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

論文の概要: When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

arxiv url: http://arxiv.org/abs/2201.10801v1
Date: Wed, 26 Jan 2022 08:17:06 GMT
ステータス: 翻訳完了
システム内更新日: 2022-01-27 19:36:58.129591
Title: When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism
Title（参考訳）: Shift OperationがVision Transformerと出会う: 注意メカニズムの極めてシンプルな代替手段
Authors: Guangting Wang, Yucheng Zhao, Chuanxin Tang, Chong Luo, Wenjun Zeng
Abstract要約: 注意機構は視覚変換器(ViT)の成功の鍵として広く信じられている。 ZERO FLOP と ZERO パラメータです。新しいバックボーンネットワーク ShiftViT を構築し,ViT の注目層をシフト操作で置き換える。
参考スコア（独自算出の注目度）: 74.07068010512015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention mechanism has been widely believed as the key to success of vision transformers (ViTs), since it provides a flexible and powerful way to model spatial relationships. However, is the attention mechanism truly an indispensable part of ViT? Can it be replaced by some other alternatives? To demystify the role of attention mechanism, we simplify it into an extremely simple case: ZERO FLOP and ZERO parameter. Concretely, we revisit the shift operation. It does not contain any parameter or arithmetic calculation. The only operation is to exchange a small portion of the channels between neighboring features. Based on this simple operation, we construct a new backbone network, namely ShiftViT, where the attention layers in ViT are substituted by shift operations. Surprisingly, ShiftViT works quite well in several mainstream tasks, e.g., classification, detection, and segmentation. The performance is on par with or even better than the strong baseline Swin Transformer. These results suggest that the attention mechanism might not be the vital factor that makes ViT successful. It can be even replaced by a zero-parameter operation. We should pay more attentions to the remaining parts of ViT in the future work. Code is available at github.com/microsoft/SPACH.
Abstract（参考訳）: 視覚変換器(ViT)の成功の鍵は、空間関係をモデル化するための柔軟で強力な方法を提供するため、注意機構が広く信じられている。しかし、注意機構は本当にViTに欠かせない部分なのか? 他の選択肢に置き換えられるのでしょうか? 注意機構の役割を解明するために、ZERO FLOPとZEROパラメータという非常に単純なケースに単純化する。具体的には、シフト操作を再考する。パラメータや算術計算は一切含まない。唯一の操作は、チャネルのごく一部を隣接する機能間で交換することである。この簡単な操作に基づいて、シフト操作によってViTの注目層が置換されるShiftViTと呼ばれる新しいバックボーンネットワークを構築する。 ShiftViTは、分類、検出、セグメンテーションなど、いくつかの主要なタスクでうまく機能する。性能は、強力なベースラインであるSwin Transformerと同等かそれ以上である。これらの結果は、注意機構がViTを成功させる重要な要因ではないことを示唆している。ゼロパラメータ操作に置き換えることもできる。今後の作業では、ViTの残りの部分にもっと注意を払うべきです。コードはgithub.com/microsoft/SPACHで入手できる。

関連論文リスト

CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [59.193626019860226]
ビジョントランスフォーマー(ViT)は、トークンミキサーの強力なグローバルコンテキスト能力によって、ニューラルネットワークの革命的な進歩を示す。 CAS-ViT: Convolutional Additive Self-attention Vision Transformersを紹介する。我々はCAS-ViTが他の最先端のバックボーンと比較して競争力を発揮することを示す。
論文参考訳（メタデータ） (2024-08-07T11:33:46Z)
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions [4.554319452683839]
Vision Transformer (ViT) はコンピュータビジョンにおいて大きな成功を収めているが、密集した予測タスクではうまく機能しない。コンボリューショナル・マルチスケール機能を有するVTバックボーンであるViT-CoMerについて述べる。階層的特徴をまたいだマルチスケールの融合を行う,シンプルで効率的なCNN-Transformer双方向核融合モジュールを提案する。
論文参考訳（メタデータ） (2024-03-12T07:59:41Z)
A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
ビジョントランスフォーマーは最近、洞察に富んだアーキテクチャ設計とアテンションメカニズムのために、多くのビジョンタスクに対して大きな約束をしました。我々は、自己意図の定式化を一般化し、クエリ非関連なグローバルコンテキストを直接抽象化し、グローバルコンテキストを畳み込みに統合する。 FCViT-S12は14M未満のパラメータを持つため、ImageNet-1K上でのResT-Liteの精度は3.7%向上した。
論文参考訳（メタデータ） (2022-12-23T19:13:43Z)
Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
マスク付き画像モデリング(MIM)による自己教師付き事前学習型視覚変換器(ViT)は非常に効果的であることが証明されている。カスタマイズされたアルゴリズムは、平易なViTのためにバニラと単純なMAEを使用する代わりに、例えばGreenMIMのような階層的なViTのために慎重に設計されるべきである。本稿では,自己指導型事前学習から階層型アーキテクチャ設計を遠ざける新しいアイデアを提案する。
論文参考訳（メタデータ） (2022-11-03T13:19:23Z)
Vision Transformers provably learn spatial structure [34.61885883486938]
ビジョントランスフォーマー(ViT)は、コンピュータビジョンにおける畳み込みニューラルネットワーク(CNN)と同等または優れたパフォーマンスを達成した。しかし、最近の研究によると、トレーニング損失を最小限に抑える一方で、ViTは特に空間的局所化パターンを学習している。
論文参考訳（メタデータ） (2022-10-13T19:53:56Z)
HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
我々はHiViT(Hierarchical ViT)という階層型視覚変換器の新しい設計を提案する。 HiViTはMIMで高い効率と優れたパフォーマンスを享受する。 ImageNet-1K上でMAEを実行する場合、HiViT-BはViT-Bよりも0.6%精度が向上し、Swin-Bよりも1.9$times$スピードアップしたと報告している。
論文参考訳（メタデータ） (2022-05-30T09:34:44Z)
Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition [0.0]
視覚変換器(ViT)の構造に基づく行動認識のための時間的クロスアテンション機構を提案する。ビデオフレームの各フレームにViTを適用するだけでフレームの特徴をキャプチャできるが、時間的特徴をモデル化することはできない。提案モデルでは、ViTのMSA計算において、クエリ、キー、バリューをシフトすることで、時間情報をキャプチャする。
論文参考訳（メタデータ） (2022-04-01T14:06:19Z)
Can Vision Transformers Perform Convolution? [78.42076260340869]
画像パッチを入力とする単一のViT層が任意の畳み込み操作を構成的に実行可能であることを示す。我々は、CNNを表現するビジョントランスフォーマーのヘッド数を低くする。
論文参考訳（メタデータ） (2021-11-02T03:30:17Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。