Fugu-MT 論文翻訳(概要): Improvements to Self-Supervised Representation Learning for Masked Image Modeling

論文の概要: Improvements to Self-Supervised Representation Learning for Masked Image Modeling

arxiv url: http://arxiv.org/abs/2205.10546v1
Date: Sat, 21 May 2022 09:45:50 GMT
ステータス: 翻訳完了
システム内更新日: 2022-06-05 17:11:13.863206
Title: Improvements to Self-Supervised Representation Learning for Masked Image Modeling
Title（参考訳）: マスク画像モデリングのための自己教師付き表現学習の改善
Authors: Jiawei Mao, Xuesong Yin, Yuanqi Chang, Honggu Zhou
Abstract要約: 本稿では,マスク画像モデリング(MIM)パラダイムの改良について検討する。 MIMパラダイムにより、入力画像のマスキングとマスク部分のアンマスク部分の予測により、モデルが画像の主オブジェクトの特徴を学習することができる。我々は新しいモデルであるContrastive Masked AutoEncoders (CMAE)を提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores improvements to the masked image modeling (MIM) paradigm. The MIM paradigm enables the model to learn the main object features of the image by masking the input image and predicting the masked part by the unmasked part. We found the following three main directions for MIM to be improved. First, since both encoders and decoders contribute to representation learning, MIM uses only encoders for downstream tasks, which ignores the impact of decoders on representation learning. Although the MIM paradigm already employs small decoders with asymmetric structures, we believe that continued reduction of decoder parameters is beneficial to improve the representational learning capability of the encoder . Second, MIM solves the image prediction task by training the encoder and decoder together , and does not design a separate task for the encoder . To further enhance the performance of the encoder when performing downstream tasks, we designed the encoder for the tasks of comparative learning and token position prediction. Third, since the input image may contain background and other objects, and the proportion of each object in the image varies, reconstructing the tokens related to the background or to other objects is not meaningful for MIM to understand the main object representations. Therefore we use ContrastiveCrop to crop the input image so that the input image contains as much as possible only the main objects. Based on the above three improvements to MIM, we propose a new model, Contrastive Masked AutoEncoders (CMAE). We achieved a Top-1 accuracy of 65.84% on tinyimagenet using the ViT-B backbone, which is +2.89 outperforming the MAE of competing methods when all conditions are equal. Code will be made available.
Abstract（参考訳）: 本稿では,マスク画像モデリング(MIM)パラダイムの改良について検討する。 MIMパラダイムにより、入力画像のマスキングとマスク部分のアンマスク部分の予測により、モデルが画像の主オブジェクトの特徴を学習することができる。 MIMの改善には以下の3つの方向がある。まず、エンコーダとデコーダの両方が表現学習に寄与するため、MIMは下流タスクにのみエンコーダを使用し、デコーダが表現学習に与える影響を無視する。 MIMパラダイムは、既に非対称構造を持つ小さなデコーダを使用しているが、デコーダパラメータの継続的な削減は、エンコーダの表現学習能力を改善するために有用であると考えている。第二に、MIMはエンコーダとデコーダを併用して画像予測タスクを訓練し、エンコーダの別タスクを設計しない。下流タスクの実行時のエンコーダの性能をさらに向上するため、比較学習とトークン位置予測のタスクのためのエンコーダを設計した。第3に、入力画像には背景やその他のオブジェクトが含まれており、画像内の各オブジェクトの比率が異なるため、背景または他のオブジェクトに関連するトークンの再構築は、mimが主要なオブジェクト表現を理解する意味がない。そこで,コントラストクロップを用いて入力画像の抽出を行い,入力画像が極力主オブジェクトのみを含むようにした。以上の3つのMIMの改良に基づき,新しいモデルであるContrastive Masked AutoEncoders (CMAE)を提案する。 vit-bバックボーンを用いたtinyimagenetのtop-1精度は65.84%で,全条件が等しければ+2.89で競合メソッドのmaeを上回った。コードは利用可能になる。

論文の概要: Improvements to Self-Supervised Representation Learning for Masked Image Modeling

関連論文リスト