Fugu-MT 論文翻訳(概要): Interpret Vision Transformers as ConvNets with Dynamic Convolutions

論文の概要: Interpret Vision Transformers as ConvNets with Dynamic Convolutions

arxiv url: http://arxiv.org/abs/2309.10713v1
Date: Tue, 19 Sep 2023 16:00:49 GMT
ステータス: 翻訳完了
システム内更新日: 2023-09-20 13:43:34.373173
Title: Interpret Vision Transformers as ConvNets with Dynamic Convolutions
Title（参考訳）: 動的畳み込みを用いたConvNetの解釈型視覚変換器
Authors: Chong Zhou, Chen Change Loy, Bo Dai
Abstract要約: 我々は、ビジョントランスフォーマーを動的畳み込みを備えたConvNetと解釈し、既存のトランスフォーマーと動的コンバータを統一されたフレームワークで特徴付けることができる。 ConvNetsの設計空間から視覚変換器を考えることができるため、我々の解釈もネットワーク設計を導くことができる。
参考スコア（独自算出の注目度）: 70.59235381143831
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There has been a debate about the superiority between vision Transformers and ConvNets, serving as the backbone of computer vision models. Although they are usually considered as two completely different architectures, in this paper, we interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework and compare their design choices side by side. In addition, our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets and vice versa. We demonstrate such potential through two specific studies. First, we inspect the role of softmax in vision Transformers as the activation function and find it can be replaced by commonly used ConvNets modules, such as ReLU and Layer Normalization, which results in a faster convergence rate and better performance. Second, following the design of depth-wise convolution, we create a corresponding depth-wise vision Transformer that is more efficient with comparable performance. The potential of the proposed unified interpretation is not limited to the given examples and we hope it can inspire the community and give rise to more advanced network architectures.
Abstract（参考訳）: ビジョントランスフォーマーとConvNetsの優位性については議論があり、コンピュータビジョンモデルのバックボーンとして機能している。通常2つの全く異なるアーキテクチャとみなされるが、本稿では、視覚変換器を動的畳み込みを持つConvNetと解釈し、既存の変換器と動的変換器を統一されたフレームワークで特徴付け、それらの設計選択を並べて比較する。さらに、我々の解釈はネットワーク設計のガイドにもなり、研究者は視覚変換器をConvNetsの設計空間から考えることができ、その逆も考えられる。 2つの特定の研究を通してその可能性を実証する。まず,視覚変換器におけるソフトマックスの役割をアクティベーション関数として検討し,ReLUやレイヤ正規化といった一般的なConvNetsモジュールに置き換えることで,より高速な収束率と性能向上を実現する。第二に、奥行き方向畳み込みの設計に従って、同等の性能でより効率的な奥行き方向視覚変換器を作成する。提案された統一解釈の可能性は、与えられた例に限らず、コミュニティに刺激を与え、より高度なネットワークアーキテクチャを生み出すことを望んでいる。

論文の概要: Interpret Vision Transformers as ConvNets with Dynamic Convolutions

関連論文リスト