Fugu-MT 論文翻訳(概要): Video Generation with Predictive Latents

論文の概要: Video Generation with Predictive Latents

arxiv url: http://arxiv.org/abs/2605.02134v1
Date: Mon, 04 May 2026 01:30:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.099917
Title: Video Generation with Predictive Latents
Title（参考訳）: 予測潜水剤を用いた映像生成
Authors: Yian Zhao, Feng Wang, Qiushan Guo, Chang Liu, Xiangyang Ji, Jian Zhang, Jie Chen,
Abstract要約: ビデオオートエンコーダ(Eational)は、視覚世界をコンパクトな潜在空間にマッピングすることで、潜在映像生成モデリングを可能にする。ビデオラテントの拡散性を高める方法はまだ重要で未解決の課題である。本稿では,映像再構成による予測学習を統一する,シンプルで効果的な予測的再構築手法を提案する。
参考スコア（独自算出の注目度）: 50.3100375593545
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.
Abstract（参考訳）: ビデオ変分オートエンコーダ(VAE)は、視覚世界をコンパクトな時空間にマッピングし、トレーニング効率と安定性を向上させることで、潜時ビデオ生成モデリングを可能にする。既存のビデオVAEは高い再生品質を実現するが、再現の継続的な最適化は必ずしも生成性能の向上に必ずしも寄与しない。ビデオラテントの拡散性を高める方法はまだ重要で未解決の課題である。本研究では,予測的世界モデリングの原理に触発されて,映像生成モデルを改善するための予測学習の可能性について検討する。そこで本稿では,映像再構成と予測学習を一体化する,シンプルで効果的な予測再構成手法を提案する。具体的には、将来のフレームをランダムに破棄し、部分的な過去の観測のみを符号化し、デコーダをトレーニングして観察されたフレームを再構築し、将来のフレームを同時に予測する。この設計により、潜伏空間は時間的予測構造を符号化し、より一貫性のあるビデオダイナミックス理解を構築し、生成品質を向上させることができる。 UCF101上のWan2.2 VAEよりも52%早く収束し,34.42FVDの改善を実現した。さらに, PV-VAEは, VAEトレーニングとともに再生性能が向上するだけでなく, 時間的コヒーレンスや動きの先行を効果的にとらえる潜在空間を基盤として, 下流の映像理解において一貫した向上をもたらすことを示す。

論文の概要: Video Generation with Predictive Latents

関連論文リスト