Fugu-MT 論文翻訳(概要): CV-VAE: A Compatible Video VAE for Latent Generative Video Models

論文の概要: CV-VAE: A Compatible Video VAE for Latent Generative Video Models

arxiv url: http://arxiv.org/abs/2405.20279v2
Date: Wed, 23 Oct 2024 02:38:44 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-28 17:07:33.062756
Title: CV-VAE: A Compatible Video VAE for Latent Generative Video Models
Title（参考訳）: CV-VAE: 次世代ビデオモデルのための互換性のあるビデオVAE
Authors: Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, Ying Shan,
Abstract要約: 可変エンコーダ(VAE)は、OpenAIのビデオの自動時間圧縮において重要な役割を果たす。現在、潜伏拡散に基づくビデオモデルによく使われる連続ビデオ(3D)VAEが欠けている。本稿では,静止空間が与えられた画像VAEと互換性のある映像モデル,CV-VAEの映像VAEをトレーニングする方法を提案する。
参考スコア（独自算出の注目度）: 45.702473834294146
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, we propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. With our CV-VAE, existing video models can generate four times more frames with minimal finetuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.
Abstract（参考訳）: 可変オートエンコーダ(VAE)などのネットワークを利用したビデオの時空間圧縮は、OpenAIのSORAや他の多くのビデオ生成モデルにおいて重要な役割を果たす。例えば、多くのLCMのようなビデオモデルは、VQVAEフレームワーク内で3次元VAEから派生した離散トークンの分布を学習し、多くの拡散ベースのビデオモデルは、量子化せずに2次元VAEによって抽出された連続ラテントの分布をキャプチャする。時間圧縮は、連続するフレーム間の不規則な動きをもたらす一様フレームサンプリングによって簡単に実現できる。現在、研究コミュニティでは、潜伏拡散に基づくビデオモデルによく使われる連続ビデオ(3D)VAEが欠落している。さらに、現在の拡散ベースのアプローチは、事前訓練されたテキスト・トゥ・イメージ(T2I)モデルを用いて実装されることが多いため、既存のT2Iモデルとの互換性を考慮せずにビデオVAEを直接トレーニングすると、それらの間の空間ギャップが遅延し、初期化としてT2Iモデルであっても、そのギャップを埋めるための膨大な計算資源が必要とされる。そこで本研究では,静止拡散(SD)の映像VAE,例えば画像VAE,eg,画像VAEと遅延空間が互換性のあるビデオモデルであるCV-VAEの映像VAEをトレーニングする手法を提案する。この互換性は、画像VAEを用いて正規化損失を定式化する新しい潜在空間正規化によって達成される。遅延空間との互換性から、ビデオモデルは、ビデオフレームを等間隔でサンプリングするのではなく、真に時空間的に時間的に圧縮された遅延空間において、事前訓練されたT2Iまたはビデオモデルからシームレスに訓練することができる。 CV-VAEでは、既存のビデオモデルは最小限の微調整で4倍のフレームを生成することができる。提案したビデオVAEの有効性を示すため,大規模な実験を行った。

論文の概要: CV-VAE: A Compatible Video VAE for Latent Generative Video Models

関連論文リスト