Fugu-MT 論文翻訳(概要): CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

論文の概要: CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

arxiv url: http://arxiv.org/abs/2107.00652v1
Date: Thu, 1 Jul 2021 17:59:56 GMT
ステータス: 翻訳完了
システム内更新日: 2021-07-02 13:55:32.694079
Title: CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
Title（参考訳）: CSWin Transformer: クロスシェイプWindows搭載の一般的なビジョントランスフォーマーバックボーン
Authors: Xiaoyi Dong and Jianmin Bao and Dongdong Chen and Weiming Zhang and Nenghai Yu and Lu Yuan and Dong Chen and Baining Guo
Abstract要約: 汎用視覚タスクのための効率的なトランスフォーマーベースバックボーンCSWin Transformerを提案する。トランスフォーマー設計における課題は、グローバルな自己アテンションが計算に非常に高価であるのに対して、ローカルな自己アテンションはトークン間の相互作用のフィールドを制限することが多いことである。
参考スコア（独自算出の注目度）: 99.36226415086243
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a detailed mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 51.7 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and state-of-the-art segmentation performance on ADE20K with 55.2 mIoU. The code and models will be available at https://github.com/microsoft/CSWin-Transformer.
Abstract（参考訳）: 汎用視覚タスクのための効率的なトランスフォーマーベースバックボーンCSWin Transformerを提案する。トランスフォーマー設計の課題は、グローバル自己着脱が計算に非常に高価であるのに対して、ローカルな自己着脱は各トークンの相互作用の場を制限することが多いことである。そこで本研究では, 入力特徴を等幅のストライプに分割し, 水平および垂直のストライプの自着を並列に計算し, クロス型ウィンドウを形成するクロス型ウィンドウ自着機構を開発した。計算コストを制限しつつ、強力なモデリング能力を実現するトランスネットワークの異なる層に対して、ストライプ幅の影響の詳細な数学的解析を行い、ストライプ幅を変化させる。また,既存の符号化方式よりも局所的な位置情報を扱う局所拡張位置符号化(LePE)を導入する。 LePEは自然に任意の入力解像度をサポートしており、ダウンストリームタスクには特に効果的で親しみやすい。これらの設計と階層構造を組み込んだCSWin Transformerは、共通ビジョンタスクにおける競合性能を示す。具体的には、追加のトレーニングデータやラベルなしでImageNet-1Kで85.4%のTop-1精度、COCO検出タスクで53.9ボックスAPと46.4マスクAP、ADE20Kセマンティックセグメンテーションタスクで51.7mIOUを達成し、それぞれ同じFLOP設定で、以前の最先端のSwin Transformerバックボーンを+1.2、+2.0、+1.4、+2.0で上回る。より大きなデータセットであるImageNet-21Kを事前トレーニングすることで、ImageNet-1Kで87.5%の精度と、55.2 mIoUでADE20Kで最先端のセグメンテーション性能を達成した。コードとモデルはhttps://github.com/microsoft/cswin-transformerで入手できる。

論文の概要: CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

関連論文リスト