Fugu-MT 論文翻訳(概要): Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

論文の概要: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

arxiv url: http://arxiv.org/abs/2304.08818v2
Date: Thu, 28 Dec 2023 03:31:59 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-29 23:13:48.866982
Title: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
Title（参考訳）: 潜在性拡散モデルを用いた高分解能ビデオ合成
Authors: Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis
Abstract要約: 遅延拡散モデル(LDM)は、過剰な計算要求を回避しながら高品質な画像合成を可能にする。本稿では, LDMパラダイムを高分解能な生成, 特に資源集約的なタスクに適用する。そこで本研究では,テキスト・ツー・ビデオ・モデリングによる実世界のシミュレーションとクリエイティブ・コンテンツ作成の2つの応用に焦点をあてる。
参考スコア（独自算出の注目度）: 71.11425812806431
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/
Abstract（参考訳）: 潜時拡散モデル(ldms)は圧縮された低次元潜時空間における拡散モデルを訓練することにより、過剰な計算要求を回避しつつ高品質な画像合成を可能にする。本稿では、特に資源集約的な課題である高解像度ビデオ生成にLDMパラダイムを適用した。まず、画像のみにLDMを事前訓練し、次に、潜時空間拡散モデルに時間次元を導入し、符号化された画像シーケンス、すなわちビデオの微調整を行うことにより、画像生成装置をビデオ生成装置に変換する。同様に、拡散モデルアップサンプラーを時間的に調整し、時間的に一貫したビデオスーパー解像度モデルに変換する。本研究は,実世界の運転データシミュレーションと,テキストからビデオへのモデリングによる創造的コンテンツ作成の2つの応用に焦点を当てた。特に,解像度512×1024の実動ビデオに対して,ビデオLDMの有効性を検証し,最先端の性能を実現する。さらに,本手法では,時間的アライメントモデルのみをトレーニングする必要があるため,市販の事前学習画像 LDM の活用も容易である。そうすることで、公開されている最先端のテキスト対画像のldmの安定拡散を、最大1280 x 2048までの解像度を持つ効率的で表現力のあるテキスト対ビデオモデルに転換します。この方法で訓練された時間層は、異なる微調整されたテキスト-画像 LDM に一般化される。この特性を利用して、テキストからビデオへのパーソナライズされた生成のための最初の結果を示し、将来のコンテンツ作成のためのエキサイティングな方向を示す。プロジェクトページ: https://research.nvidia.com/labs/toronto-ai/VideoLDM/

論文の概要: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

関連論文リスト