Fugu-MT 論文翻訳(概要): Transformers in Vision: A Survey

論文の概要: Transformers in Vision: A Survey

arxiv url: http://arxiv.org/abs/2101.01169v2
Date: Mon, 22 Feb 2021 11:40:11 GMT
ステータス: 翻訳完了
システム内更新日: 2021-04-12 01:34:59.953537
Title: Transformers in Vision: A Survey
Title（参考訳）: 視覚におけるトランスフォーマー: サーベイ
Authors: Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah
Abstract要約: トランスフォーマーは、入力シーケンス要素間の長い依存関係をモデリングし、シーケンスの並列処理をサポートします。変圧器は設計に最小限の誘導バイアスを必要とし、自然にセット関数として適しています。本調査は,コンピュータビジョン分野におけるトランスフォーマーモデルの概要を概観することを目的としている。
参考スコア（独自算出の注目度）: 101.07348618962111
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.
Abstract（参考訳）: 自然言語タスクにおけるTransformerモデルの結果は、コンピュータビジョン問題への応用を研究するビジョンコミュニティの興味を引いている。その顕著な利点のうち、トランスフォーマーは入力シーケンス要素間の長い依存関係をモデル化し、リカレントネットワーク(例えばlong short-term memory(lstm))と比較してシーケンスの並列処理をサポートする。畳み込みネットワークと異なり、トランスフォーマーは設計に最小限の帰納的バイアスを必要とし、自然に集合関数として適合する。さらに、トランスフォーマーの簡単な設計により、同様の処理ブロックを使用して複数のモダリティ(画像、ビデオ、テキスト、音声など)を処理でき、非常に大きな容量のネットワークや巨大なデータセットに対して優れたスケーラビリティを示す。これらの強みは、Transformerネットワークを使った多くのビジョンタスクのエキサイティングな進歩につながった。本調査は,コンピュータビジョン分野におけるトランスフォーマーモデルの概要を明らかにすることを目的とする。まず,トランスフォーマーの成功を支える基本概念,すなわち自己注意,大規模事前学習,双方向符号化の導入から始める。 We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). アーキテクチャ設計と実験的価値の両方の観点から,人気のある手法の長所と限界を比較した。最後に,オープン研究の方向性と今後の課題について分析する。

論文の概要: Transformers in Vision: A Survey

関連論文リスト