Fugu-MT 論文翻訳(概要): TrackMAE: Video Representation Learning via Track Mask and Predict

論文の概要: TrackMAE: Video Representation Learning via Track Mask and Predict

arxiv url: http://arxiv.org/abs/2603.27268v1
Date: Sat, 28 Mar 2026 13:35:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.874264
Title: TrackMAE: Video Representation Learning via Track Mask and Predict
Title（参考訳）: TrackMAE: トラックマスクと予測によるビデオ表現学習
Authors: Renaud Vandeghen, Fida Mohammad Thoker, Marc Van Droogenbroeck, Bernard Ghanem,
Abstract要約: Masked Video Modeling (MVM)は、シンプルでスケーラブルな自己教師付き事前トレーニングパラダイムとして登場した。動作情報を復元信号として明示的に利用するシンプルなマスク付きビデオモデリングパラダイムであるTrackMAEを提案する。我々は、さまざまな下流設定の6つのデータセットを評価し、TrackMAEが最先端のビデオ自己教師型学習ベースラインを一貫して上回っていることを発見した。
参考スコア（独自算出の注目度）: 53.79942817343784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos, generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction spaces by providing a complementary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms state-of-the-art video self-supervised learning baselines, learning more discriminative and generalizable representations. Code available at https://github.com/rvandeghen/TrackMAE
Abstract（参考訳）: Masked Video Modeling (MVM) は、単純でスケーラブルな自己教師付き事前学習パラダイムとして登場したが、暗黙的に動作情報を符号化するだけであり、学習された表現における時間ダイナミクスの符号化を制限する。結果として、このようなモデルは、微粒な動き認識を必要とする動き中心のタスクに苦しむ。そこで我々は,動作情報を復元信号として明示的に利用するシンプルなマスク付きビデオモデリングパラダイムであるTrackMAEを提案する。 TrackMAEでは、オフ・ザ・シェルフ・ポイント・トラッカーを使用して、入力ビデオ内のポイントを疎に追跡し、モーション・トラジェクトリを生成する。さらに、抽出した軌道を利用して、動き認識マスキング戦略を用いてランダムな管マスキングを改善する。我々は,動作目標の形で補完的な監視信号を提供することにより,画素と特徴的セマンティック再構築空間で学習した映像表現を強化する。我々は、さまざまな下流設定の6つのデータセットを評価し、TrackMAEは最先端のビデオの自己教師型学習ベースラインを一貫して上回り、より差別的で一般化可能な表現を学習する。 https://github.com/rvandeghen/TrackMAE

論文の概要: TrackMAE: Video Representation Learning via Track Mask and Predict

関連論文リスト