Fugu-MT 論文翻訳(概要): UniFormer: Unifying Convolution and Self-attention for Visual Recognition

論文の概要: UniFormer: Unifying Convolution and Self-attention for Visual Recognition

arxiv url: http://arxiv.org/abs/2201.09450v1
Date: Mon, 24 Jan 2022 04:39:39 GMT
ステータス: 翻訳完了
システム内更新日: 2022-01-25 15:46:46.875557
Title: UniFormer: Unifying Convolution and Self-attention for Visual Recognition
Title（参考訳）: UniFormer: 視覚認識のための畳み込みと自己注意の統合
Authors: Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao
Abstract要約: 畳み込みニューラルネットワーク(CNN)とビジョントランスフォーマー(ViT)は、ここ数年で主要なフレームワークである。コンボリューションと自己注意の利点を簡潔なトランスフォーマー形式にシームレスに統合する新しいUnified TransFormer(UniFormer)を提案する。我々のUniFormerはImageNet-1K分類において86.3トップ1の精度を実現している。
参考スコア（独自算出の注目度）: 69.68907941116127
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our UniFormer blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks, e.g., it obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Something-Something V1/V2 video classification tasks, 53.8 box AP and 46.4 mask AP on COCO object detection task, 50.8 mIoU on ADE20K semantic segmentation task, and 77.4 AP on COCO pose estimation task. Code is available at https://github.com/Sense-X/UniFormer.
Abstract（参考訳）: 画像やビデオから識別表現を学ぶことは、大きな局所冗長性と、これらの視覚データにおける複雑なグローバル依存のため、難しい課題である。畳み込みニューラルネットワーク(cnns)と視覚トランスフォーマー(vits)は、ここ数年で2つの主要なフレームワークとなっている。 cnnは小さな近傍での畳み込みによって局所冗長性を効率的に減らすことができるが、限定的な受容場はグローバルな依存を捉えることが困難である。あるいは、ViTsは自己注意による長距離依存を効果的に捉えることができるが、トークン間の視覚的類似性比較は高い冗長性をもたらす。これらの問題を解決するために,コンボリューションと自己注意の利点を簡潔なトランスフォーマー形式にシームレスに統合できる新しいUnified TransFormer(UniFormer)を提案する。典型的な変換ブロックとは異なり、UniFormerブロック内の関係アグリゲータは、それぞれ浅層と深層に局所的および大域的トークン親和性を備えており、冗長性と依存性の両方に対処し、効率的かつ効率的な表現学習を可能にする。最後に、UniFormerブロックを柔軟に新しい強力なバックボーンにスタックし、分類から密集した予測まで、画像からビデオ領域まで様々な視覚タスクに適用します。トレーニングデータなしでは、imagenet-1k分類において86.3top-1精度が得られる。 imagenet-1kを事前トレーニングするだけで、速度-400/600での82.9/84.8 top-1精度、何らかのv1/v2ビデオ分類タスクの60.9/71.2 top-1精度、cocoオブジェクト検出タスクの53.8 box apと46.4 mask ap、ade20kセマンティクスセグメンテーションタスクの50.8 miou、cocoポーズ推定タスクの77.4 apが得られる。コードはhttps://github.com/Sense-X/UniFormer.comで入手できる。

論文の概要: UniFormer: Unifying Convolution and Self-attention for Visual Recognition

関連論文リスト