Fugu-MT 論文翻訳(概要): BEVT: BERT Pretraining of Video Transformers

論文の概要: BEVT: BERT Pretraining of Video Transformers

arxiv url: http://arxiv.org/abs/2112.01529v1
Date: Thu, 2 Dec 2021 18:59:59 GMT
ステータス: 翻訳完了
システム内更新日: 2021-12-03 15:06:30.779298
Title: BEVT: BERT Pretraining of Video Transformers
Title（参考訳）: bevt:ビデオトランスフォーマーのbertプリトレーニング
Authors: Rui Wang and Dongdong Chen and Zuxuan Wu and Yinpeng Chen and Xiyang Dai and Mengchen Liu and Yu-Gang Jiang and Luowei Zhou and Lu Yuan
Abstract要約: 本稿では,映像表現学習を空間表現学習と時間ダイナミクス学習に分離するBEVTを紹介する。我々は、BEVTが非常に有望な結果を得る3つの挑戦的なビデオベンチマークについて広範な実験を行った。
参考スコア（独自算出の注目度）: 89.08460834954161
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper studies the BERT pretraining of video transformers. It is a straightforward but worth-studying extension given the recent success from BERT pretraining of image transformers. We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning. In particular, BEVT first performs masked image modeling on image data, and then conducts masked image modeling jointly with masked video modeling on video data. This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive if trained from scratch; 2) discriminative clues, i.e., spatial and temporal information, needed to make correct predictions vary among different videos due to large intra-class and inter-class variations. We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results. On Kinetics 400, for which recognition mostly relies on discriminative spatial representations, BEVT achieves comparable results to strong supervised baselines. On Something-Something-V2 and Diving 48, which contain videos relying on temporal dynamics, BEVT outperforms by clear margins all alternative baselines and achieves state-of-the-art performance with a 70.6% and 86.7% Top-1 accuracy respectively.
Abstract（参考訳）: 本稿では,ビデオトランスのBERT事前学習について検討する。 BERTによる画像トランスフォーマーの事前トレーニングが最近成功したことを考えると、これは単純だが価値のある拡張である。本稿では,映像表現学習を空間表現学習と時間ダイナミクス学習に分離するBEVTを紹介する。特に、BEVTはまず画像データ上でマスク画像モデリングを行い、次に動画データ上でマスク映像モデリングと共同でマスク画像モデリングを行う。このデザインの動機は2つの観察です 1)画像データセットで学習したトランスフォーマーは、ビデオトランスフォーマーの学習を容易化するための十分な空間的事前情報を提供する。 2) クラス内およびクラス間の変化が大きいため,正しい予測を行うために必要な識別的手がかり,すなわち空間的および時間的情報。我々は、BEVTが非常に有望な結果を得る3つの挑戦的なビデオベンチマークについて広範な実験を行った。認識は主に差別的な空間表現に依存しているKineetics 400では、BEVTは強い教師付きベースラインに匹敵する結果を得る。時間力学に依存したビデオを含むSomething-V2とDiving 48では、BEVTは全ての代替ベースラインをクリアマージンで上回り、それぞれ70.6%と86.7%の精度で最先端のパフォーマンスを達成する。

論文の概要: BEVT: BERT Pretraining of Video Transformers

関連論文リスト