Fugu-MT 論文翻訳(概要): The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

論文の概要: The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

arxiv url: http://arxiv.org/abs/2203.06345v1
Date: Sat, 12 Mar 2022 04:48:12 GMT
ステータス: 翻訳完了
システム内更新日: 2022-03-15 14:20:39.831746
Title: The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy
Title（参考訳）: 多様性の原則:全ての冗長性を減らすための強い視力変換器の訓練
Authors: Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, Zhangyang Wang
Abstract要約: 本稿では,パッチ埋め込み,アテンションマップ,ウェイトスペースという3つのレベルにおいて,冗長性のユビキタスな存在を体系的に研究する。各レベルにおける表現の多様性とカバレッジを促進するための対応正規化器を提案する。
参考スコア（独自算出の注目度）: 111.49944789602884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision transformers (ViTs) have gained increasing popularity as they are commonly believed to own higher modeling capacity and representation flexibility, than traditional convolutional networks. However, it is questionable whether such potential has been fully unleashed in practice, as the learned ViTs often suffer from over-smoothening, yielding likely redundant models. Recent works made preliminary attempts to identify and alleviate such redundancy, e.g., via regularizing embedding similarity or re-injecting convolution-like structures. However, a "head-to-toe assessment" regarding the extent of redundancy in ViTs, and how much we could gain by thoroughly mitigating such, has been absent for this field. This paper, for the first time, systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space. In view of them, we advocate a principle of diversity for training ViTs, by presenting corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information. Extensive experiments on ImageNet with a number of ViT backbones validate the effectiveness of our proposals, largely eliminating the observed ViT redundancy and significantly boosting the model generalization. For example, our diversified DeiT obtains 0.70%~1.76% accuracy boosts on ImageNet with highly reduced similarity. Our codes are fully available in https://github.com/VITA-Group/Diverse-ViT.
Abstract（参考訳）: 視覚トランスフォーマー(vits)は、従来の畳み込みネットワークよりも高いモデリング能力と表現柔軟性を持つと信じられているため、人気が高まっている。しかし、学習したViTは過度なスムースティングに悩まされ、おそらく冗長なモデルをもたらすため、そのようなポテンシャルが実際に完全に解き放たれたかどうかは疑わしい。最近の研究は、例えば埋め込み類似性を規則化したり、畳み込みのような構造を再注入することで、そのような冗長性を識別し緩和する予備的な試みを行った。しかし、この分野では、ViTの冗長性の程度と、それを徹底的に緩和することでどれだけの利益が得られるかに関する「先行き評価」が欠落している。本論文は, パッチ埋め込み, 注意マップ, 重量空間の3つのレベルにおいて, 冗長性のユビキタスな存在を体系的に研究した。そこで我々は,vitの訓練における多様性の原則を提唱し,それぞれのレベルにおける表現の多様性と範囲を奨励する対応する正規化子を提示することで,より識別的な情報を捉えることを可能にする。多数のViTバックボーンを用いたImageNetの大規模な実験により,提案手法の有効性が検証され,観測されたViT冗長性が大幅に低減され,モデル一般化が大幅に向上した。例えば、当社の多様化したDeiTでは、ImageNet上で0.70%～1.76%の精度向上を実現しています。私たちのコードはhttps://github.com/VITA-Group/Diverse-ViT.comで利用可能です。

論文の概要: The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

関連論文リスト