Fugu-MT 論文翻訳(概要): Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

論文の概要: Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

arxiv url: http://arxiv.org/abs/2606.14765v1
Date: Mon, 08 Jun 2026 17:50:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 18:36:04.912786
Title: Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning
Title（参考訳）: Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning (英語)
Authors: Qinwu Xu,
Abstract要約: 自己教師型ビデオ表現学習のためのMomentum-Guided Semantic Forecastingフレームワークを提案する。このフレームワークは、トレーニング中にアクションラベルを使わずに、時間的に一貫性があり、意味的に意味のあるビデオ表現を学習する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Self-supervised video representation learning has recently advanced through contrastive learning, masked reconstruction, and predictive representation learning. Reconstruction-based approaches such as MAE and VideoMAE learn representations by recovering masked visual content \cite{he2022mae,tong2022videomae}, while contrastive methods such as CLIP learn semantically meaningful embedding spaces through representation alignment \cite{radford2021clip}. In this work, we introduce a Momentum-Guided Semantic Forecasting framework (MoFore) for self-supervised video representation learning. Instead of optimizing for pixel-level reconstruction or task-specific semantic alignment, the proposed method learns temporally predictive video representations by forecasting future latent embeddings from temporally distant context clips. To improve robustness across temporal scales, we further introduce randomized temporal-gap forecasting during training. The framework combines predictive latent forecasting with contrastive regularization to encourage temporal consistency while preventing representation collapse. Experiments on the UCF101 dataset demonstrate that the proposed framework learns temporally consistent and semantically meaningful video representations without using action labels during training. Quantitative analysis shows strong temporal stability and emergent category-level structure in the learned embedding space, while qualitative retrieval experiments reveal motion-aware organization across related activities. Overall, the results suggest that long-range latent forecasting provides an effective and computationally efficient approach for self-supervised video representation learning without relying on reconstruction-based objectives.
Abstract（参考訳）: 自己教師付きビデオ表現学習は、最近、コントラスト学習、マスク付き再構成、予測表現学習を通じて進歩している。 MAEやVideoMAEのような再構成ベースのアプローチは、マスク付きビジュアルコンテンツ \cite{he2022mae,tong2022 videomae} を復元することで表現を学ぶ一方で、CLIPのような対照的な手法は、表現アライメント \cite{radford2021clip} を通じて意味的に意味のある埋め込み空間を学習する。本研究では,自己教師付きビデオ表現学習のためのMomentum-Guided Semantic Forecasting framework (MoFore)を提案する。提案手法は,画素レベルの再構成やタスク固有のセマンティックアライメントを最適化する代わりに,時間的に離れたコンテキストクリップから将来の潜伏埋め込みを予測することによって,時間的に予測された映像表現を学習する。時間スケールにおけるロバスト性を改善するために、トレーニング中にランダム化された時間ギャップ予測を導入する。このフレームワークは予測潜在予測と対照的な正規化を組み合わせることで、表現の崩壊を防ぎながら時間的一貫性を促進する。 UCF101データセットの実験は、トレーニング中にアクションラベルを使わずに、提案フレームワークが時間的に一貫性があり、意味的に意味のあるビデオ表現を学ぶことを示した。定量的分析では,学習した埋め込み空間における時間的安定性と創発的カテゴリーレベルの構造が示され,質的検索実験では関連する活動にまたがる動き認識機構が明らかとなった。以上の結果から,長期潜伏予測は自己教師付き映像表現学習において,再構成に基づく目的に頼らずに効果的かつ効率的な手法を提供する可能性が示唆された。

論文の概要: Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

関連論文リスト