Fugu-MT 論文翻訳(概要): What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction

論文の概要: What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction

arxiv url: http://arxiv.org/abs/2606.07687v1
Date: Fri, 05 Jun 2026 04:43:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.277063
Title: What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction
Title（参考訳）: ビデオ・ワールド・モデルのアクション関連性:再現性に関する予測
Authors: Jewon Yeom, Hanseul Kim, Jeongjae Park, Sungmok Jung, Jaejin Lee, Taesup Kim,
Abstract要約: 動作関連構造は、主に画素再構成の忠実度ではなく、時間的ビデオ事前学習によって駆動される。本研究は,行動関連ビデオ表現の主成分として時間的予測構造(再構成忠実性ではなく,時間的予測構造)を同定した。
参考スコア（独自算出の注目度）: 9.020077150911526
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video world models are increasingly used to provide predictive visual representations, yet it remains unclear which pretraining signals induce action-relevant structure in their latent spaces. We study this question through a unified probe-based evaluation across diverse encoder families, including image-only self-supervision, video pretraining with and without latent prediction, reconstruction-based autoencoders, diffusion models, and shortcut-forcing dynamics models. Using a common inverse-dynamics probing objective, we find that action-relevant structure is driven primarily by temporal video pretraining rather than pixel reconstruction fidelity: models with strong pixel decoding quality can exhibit near-zero action recoverability, while video-pretrained self-supervised encoders consistently achieve the best Pareto trade-off between visual fidelity and action prediction. Comparing V-JEPA and VideoMAE further shows that most gains arise from natural-video temporal context, with feature-level latent prediction providing a smaller additional benefit. These trends transfer across robotic benchmarks, though CALVIN reveals that static-environment tasks can partially mask the importance of temporal structure by allowing strong image priors to suffice. Finally, inverse-dynamics supervision substantially improves robustness to visual corruption, suggesting that action-aware objectives regularize latent geometry beyond clean-setting performance. Our results identify temporal predictive structure -- not reconstruction fidelity -- as the primary ingredient underlying action-relevant video representations.
Abstract（参考訳）: ビデオワールドモデルは、予測的な視覚表現を提供するためにますます使われているが、どの事前学習信号が、その潜在空間における行動関連構造を誘導するかは定かではない。本研究では,画像のみの自己スーパービジョン,映像事前学習,遅延予測,再構成に基づくオートエンコーダ,拡散モデル,ショートカット強制力学モデルなど,多種多様なエンコーダ群を対象とした統一的なプローブベース評価を行った。画素復号精度の強いモデルでは、ほぼゼロに近い動作回復性を示すことができる一方、ビデオ事前制御型自己教師型エンコーダは、視覚的忠実度と行動予測との最高のパレートトレードオフを一貫して達成することができる。 V-JEPA と VideoMAE を比較すると、ほとんどの利得は自然ビデオの時間的文脈から生じており、特徴レベルの潜在予測はより小さな付加的な利点をもたらす。これらの傾向は、ロボットベンチマーク間で伝達されるが、CALVINは、静的環境タスクは、強い画像が十分であるようにすることで、時間構造の重要性を部分的に隠蔽できることを示した。最後に、逆力学の監督は、視覚的腐敗に対する堅牢性を大幅に改善し、アクション認識の目的が、クリーンな設定性能以上の潜時幾何学を規則化することを示唆している。本研究は,行動関連ビデオ表現の主成分として時間的予測構造(再構成忠実性ではなく,時間的予測構造)を同定した。

論文の概要: What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction

関連論文リスト