Fugu-MT 論文翻訳(概要): AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

論文の概要: AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

arxiv url: http://arxiv.org/abs/2303.07598v1
Date: Tue, 14 Mar 2023 02:42:01 GMT
ステータス: 翻訳完了
システム内更新日: 2023-03-15 16:34:16.771316
Title: AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+
Title（参考訳）: AdPE: MAE+によるビジョントランスの事前学習のための逆位置埋め込み
Authors: Xiao Wang, Ying Wang, Ziwei Xuan, Guo-Jun Qi
Abstract要約: 本稿では,前訓練型視覚変換器に対するAdPE (Adversarial Positional Embedding) アプローチを提案する。 AdPEは位置エンコーディングを摂動することで局所的な視覚構造を歪ませる。実験により,本手法はMAEの微調整精度を向上させることができることが示された。
参考スコア（独自算出の注目度）: 44.856035786948915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unsupervised learning of vision transformers seeks to pretrain an encoder via pretext tasks without labels. Among them is the Masked Image Modeling (MIM) aligned with pretraining of language transformers by predicting masked patches as a pretext task. A criterion in unsupervised pretraining is the pretext task needs to be sufficiently hard to prevent the transformer encoder from learning trivial low-level features not generalizable well to downstream tasks. For this purpose, we propose an Adversarial Positional Embedding (AdPE) approach -- It distorts the local visual structures by perturbing the position encodings so that the learned transformer cannot simply use the locally correlated patches to predict the missing ones. We hypothesize that it forces the transformer encoder to learn more discriminative features in a global context with stronger generalizability to downstream tasks. We will consider both absolute and relative positional encodings, where adversarial positions can be imposed both in the embedding mode and the coordinate mode. We will also present a new MAE+ baseline that brings the performance of the MIM pretraining to a new level with the AdPE. The experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE by $0.8\%$ and $0.4\%$ over 1600 epochs of pretraining ViT-B and ViT-L on Imagenet1K. For the transfer learning task, it outperforms the MAE with the ViT-B backbone by $2.6\%$ in mIoU on ADE20K, and by $3.2\%$ in AP$^{bbox}$ and $1.6\%$ in AP$^{mask}$ on COCO, respectively. These results are obtained with the AdPE being a pure MIM approach that does not use any extra models or external datasets for pretraining. The code is available at https://github.com/maple-research-lab/AdPE.
Abstract（参考訳）: ビジョントランスフォーマーの教師なし学習は、ラベルなしでプリテキストタスクを介してエンコーダを事前訓練しようとする。中でもMasked Image Modeling (MIM)は、プリテキストタスクとしてマスクされたパッチを予測することで、言語トランスフォーマーの事前トレーニングと一致している。教師なし事前学習の基準は、トランスフォーマーエンコーダが下流タスクをうまく一般化できない自明な低レベル特徴を学習するのを防ぐのに十分なテキストタスクが必要であることである。この目的のために,adpe(adversarial positional embedded)アプローチを提案する。これは位置符号化をゆがめることで局所的な視覚構造を歪め,学習したトランスフォーマーが局所的に相関したパッチを単純に使用できないようにする。我々は、トランスフォーマーエンコーダに、ダウンストリームタスクへのより一般化性を備えた、グローバルコンテキストにおけるより識別的な特徴を学ぶよう強制する、と仮定する。我々は絶対的および相対的な位置符号化を考慮し、逆位置を埋め込みモードと座標モードの両方に課すことができる。また、新しいMAE+ベースラインを提示し、MIMプリトレーニングのパフォーマンスをAdPEで新しいレベルに引き上げる。実験の結果,Imagenet1K 上での ViT-B と ViT-L の事前学習において,MAE の微調整精度を $0.8\%$ と $0.4\%$ で向上できることがわかった。転送学習タスクでは、ADE20K上ではmIoUで2.6\%、COCO上ではAP$^{bbox}$で3.2\%、AP$^{mask}$で1.6\%である。これらの結果は、事前トレーニングに余分なモデルや外部データセットを使用しない純粋なmimアプローチであるadpeによって得られる。コードはhttps://github.com/maple-research-lab/adpeで入手できる。

論文の概要: AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

関連論文リスト