Fugu-MT 論文翻訳(概要): Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

論文の概要: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

arxiv url: http://arxiv.org/abs/2102.12122v1
Date: Wed, 24 Feb 2021 08:33:55 GMT
ステータス: 翻訳完了
システム内更新日: 2021-02-25 17:55:26.323370
Title: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Title（参考訳）: Pyramid Vision Transformer: 畳み込みのない密度予測のための汎用バックボーン
Authors: Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao
Abstract要約: この研究は、畳み込みのない多くの密な予測タスクに有用な単純なバックボーンネットワークを調査します。画像分類用に特別に設計された最近提案されたトランスフォーマーモデル(例: ViT)とは異なり、Pyramid Vision Transformer(PVT)を提案する。 PVTは、高出力の解像度を達成するために画像の高密度分割をトレーニングするだけでなく、高密度の予測に重要である。
参考スコア（独自算出の注目度）: 103.03973037619532
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches. Code is available at https://github.com/whai362/PVT.
Abstract（参考訳）: 畳み込みニューラルネットワーク(CNN)をコンピュータビジョンにおいて大きな成功を収める一方で、畳み込みのない多くの密集した予測タスクに有用な単純なバックボーンネットワークを探索する。近年提案されている画像分類用トランスフォーマーモデル(例えばvit)とは異なり、様々な密集した予測タスクへのトランスフォーマーの移植の難しさを克服するピラミッドビジョントランスフォーマー(pvt)を提案する。 PVTは先行技術と比較していくつかの利点がある。 1)通常、低解像度の出力と高い計算およびメモリコストを有するViTとは異なり、PVTは、高密度の予測のために重要である高出力の解像度を達成するために画像の高密度分割で訓練することができるだけでなく、大規模な特徴マップの計算を減らすために進歩的な縮小ピラミッドを使用する。 2) PVTはCNNとTransformerの両方の利点を継承し、CNNのバックボーンを置き換えるだけで、畳み込みのない様々なビジョンタスクで統一されたバックボーンになります。 3)幅広い実験を行ってpvtを検証することで,オブジェクト検出やセマンティクス,インスタンスセグメンテーションなど,多くのダウンストリームタスクのパフォーマンスが向上することを示す。例えば、同等のパラメータ数で、RetinaNet+PVTはCOCOデータセット上で40.4 APを達成し、RetinNet+ResNet50(36.3 AP)を4.1絶対APで上回る。 PVTがピクセルレベルの予測の代替的および有用なバックボーンとなり、将来の研究を促進することを期待しています。コードはhttps://github.com/whai362/PVTで入手できます。

論文の概要: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

関連論文リスト