Fugu-MT 論文翻訳(概要): Revisiting Vision Transformer from the View of Path Ensemble

論文の概要: Revisiting Vision Transformer from the View of Path Ensemble

arxiv url: http://arxiv.org/abs/2308.06548v1
Date: Sat, 12 Aug 2023 12:18:16 GMT
ステータス: 翻訳完了
システム内更新日: 2023-08-15 16:44:27.800617
Title: Revisiting Vision Transformer from the View of Path Ensemble
Title（参考訳）: 経路アンサンブルから見た視覚変換器の再検討
Authors: Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou
Abstract要約: 視覚変換器(ViT)は通常、トランス層のスタックと見なされる。異なる長さの複数の並列経路を含むアンサンブルネットワークとして、ViTを見ることができることを示す。
参考スコア（独自算出の注目度）: 40.093943843198424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers (ViTs) are normally regarded as a stack of transformer layers. In this work, we propose a novel view of ViTs showing that they can be seen as ensemble networks containing multiple parallel paths with different lengths. Specifically, we equivalently transform the traditional cascade of multi-head self-attention (MSA) and feed-forward network (FFN) into three parallel paths in each transformer layer. Then, we utilize the identity connection in our new transformer form and further transform the ViT into an explicit multi-path ensemble network. From the new perspective, these paths perform two functions: the first is to provide the feature for the classifier directly, and the second is to provide the lower-level feature representation for subsequent longer paths. We investigate the influence of each path for the final prediction and discover that some paths even pull down the performance. Therefore, we propose the path pruning and EnsembleScale skills for improvement, which cut out the underperforming paths and re-weight the ensemble components, respectively, to optimize the path combination and make the short paths focus on providing high-quality representation for subsequent paths. We also demonstrate that our path combination strategies can help ViTs go deeper and act as high-pass filters to filter out partial low-frequency signals. To further enhance the representation of paths served for subsequent paths, self-distillation is applied to transfer knowledge from the long paths to the short paths. This work calls for more future research to explain and design ViTs from new perspectives.
Abstract（参考訳）: 視覚変換器(ViT)は通常、トランス層のスタックと見なされる。本研究では,異なる長さの複数の並列経路を含むアンサンブルネットワークとして見ることのできる,新しいViTのビューを提案する。具体的には、従来のマルチヘッドセルフアテンション(msa)とフィードフォワードネットワーク(ffn)のカスケードを、トランスフォーマー層毎に3つの並列パスに変換する。そして、新しいトランスフォームのID接続を利用し、さらにViTを明示的なマルチパスアンサンブルネットワークに変換する。新しい観点では、これらのパスは2つの機能を実行する: 1つは、分類器の機能を直接提供し、もう1つは、続く長いパスに対して下位レベルの特徴表現を提供することである。最終予測に対する各パスの影響を調査し,いくつかのパスが性能を低下させる可能性を見出した。そこで本研究では,低パフォーマンスパスの削減とアンサンブルコンポーネントの再重み付けを行い,経路の組み合わせを最適化し,後続パスに高品質表現を提供することに重点を置く,改善のためのパスプルーニングとアンサンブルスケールスキルを提案する。また、私たちの経路の組み合わせ戦略は、ViTをより深くし、部分的な低周波信号をフィルタするハイパスフィルタとして機能することを示す。続く経路に供される経路の表現をさらに強化するため、長い経路から短い経路へ知識を伝達するために自己蒸留を適用する。この研究は、ViTを新たな視点から説明し、設計するためのさらなる研究を求めている。

論文の概要: Revisiting Vision Transformer from the View of Path Ensemble

関連論文リスト