Fugu-MT 論文翻訳(概要): Large Motion Video Autoencoding with Cross-modal Video VAE

論文の概要: Large Motion Video Autoencoding with Cross-modal Video VAE

arxiv url: http://arxiv.org/abs/2412.17805v1
Date: Mon, 23 Dec 2024 18:58:24 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-24 19:42:48.554008
Title: Large Motion Video Autoencoding with Cross-modal Video VAE
Title（参考訳）: クロスモーダルビデオVAEを用いた大動画像自動符号化
Authors: Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen,
Abstract要約: ビデオ可変オートエンコーダ(VAE)は、ビデオ冗長性を低減し、効率的なビデオ生成を容易にするために不可欠である。既存のビデオVAEは時間圧縮に対処し始めているが、しばしば再建性能が不十分である。本稿では,高忠実度ビデオエンコーディングが可能な,新規で強力なビデオオートエンコーダを提案する。
参考スコア（独自算出の注目度）: 52.13379965800485
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.
Abstract（参考訳）: ビデオ冗長性を低減し、効率的なビデオ生成を容易にするためには、ロバストなビデオ変分オートコーダ(VAE)の学習が不可欠である。個々のフレームに直接VAEを適用すると、時間的圧縮の欠如により、時間的不整合と最適以下の圧縮率が発生する。既存のビデオVAEは時間圧縮に対処し始めているが、しばしば再建性能が不十分である。本稿では,高忠実度ビデオエンコーディングが可能な,斬新で強力なビデオオートエンコーダを提案する。まず,画像のVAEを3次元VAEに拡張するだけで,空間的・時間的圧縮の絡み合いを観察する。そこで本稿では,空間情報をよりよく符号化し,復号化するための時空間圧縮を提案する。さらに,さらに時間的圧縮を行うために,軽量な動き圧縮モデルを統合する。第2に,テキスト・トゥ・ビデオ・データセットに固有のテキスト情報を活用し,本モデルにテキストガイダンスを組み込むことを提案する。これにより、特に詳細な保存と時間的安定性の観点から、復元品質が著しく向上する。第3に,画像とビデオの協調学習により,モデルの汎用性をさらに向上させ,再現性の向上だけでなく,画像とビデオの自動符号化も可能とした。近年の強靭なベースラインに対する広範囲な評価は,本手法の優れた性能を示す。プロジェクトのWebサイトは~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/} にある。

論文の概要: Large Motion Video Autoencoding with Cross-modal Video VAE

関連論文リスト