Fugu-MT 論文翻訳(概要): OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

論文の概要: OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

arxiv url: http://arxiv.org/abs/2508.09857v1
Date: Wed, 13 Aug 2025 14:49:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-14 20:42:00.935051
Title: OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better
Title（参考訳）: OneVAE: ビデオVAEトレインの離散化を支援する共同離散化と連続最適化
Authors: Yupeng Zhou, Zhen Li, Ziheng Ouyang, Yuming Chen, Ruoyi Du, Daquan Zhou, Bin Fu, Yihao Liu, Peng Gao, Ming-Ming Cheng, Qibin Hou,
Abstract要約: FSQは、他の量子化法と比較して、実効的に事前学習された連続VAE前処理を行えることを示す。トークン圧縮比を損なうことなくPSNR次元を約1倍改善するマルチトークン量子化機構を提案する。本稿では,この2つのパラダイムを統一した離散連続最適化手法を提案する。
参考スコア（独自算出の注目度）: 75.24657690640525
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Encoding videos into discrete tokens could align with text tokens to facilitate concise and unified multi-modal LLMs, yet introducing significant spatiotemporal compression compared to continuous video representation. Previous discrete video VAEs experienced unstable training, long training time, and degraded reconstruction quality. Given the easier training and superior performance of continuous VAEs, an intuitive idea is to enhance discrete video VAEs by leveraging continuous VAEs. After rethinking the intrinsic link between discrete and continuous representations, we found that FSQ could effectively preserve pre-trained continuous VAE priors compared to other quantization methods. By leveraging continuous VAE priors, it converges several times faster than training from scratch and achieves superior performance at convergence. Meanwhile, two structural improvements are proposed. First, inspired by how continuous VAEs enhance reconstruction via enlarged latent dimensions, we introduce a multi-token quantization mechanism, which achieves nearly a 1 dB improvement in PSNR without compromising the token compression ratio. Second, to tackle reconstruction challenges in high-compression video VAEs, we strengthen first-frame reconstruction, enabling the causal VAE to leverage this information in subsequent frames and markedly improving the performance of 4 x 16 x 16 discrete VAEs. Furthermore, we propose a joint discrete-continuous optimization scheme that unifies the two paradigms and, for the first time, achieves competitive performance on both continuous and discrete representations within a single network. We name our method OneVAE to reflect this connection.
Abstract（参考訳）: ビデオの離散トークンへのエンコーディングは、テキストトークンと整合して、簡潔で統一されたマルチモーダルLCMを容易にするが、連続的なビデオ表現と比較して、時間空間の大幅な圧縮を導入する。以前の離散ビデオVAEは不安定な訓練、長い訓練時間、劣化した再建品質を経験していた。連続型VAEの訓練が容易で優れた性能を考えると、直感的な考え方は連続型VAEを活用して離散型ビデオVAEを強化することである。離散表現と連続表現の内在的リンクを再考した結果,FSQは他の量子化法と比較して,事前学習された連続VAE先行を効果的に保存できることが判明した。連続VAEプリエンスを活用することで、スクラッチからのトレーニングよりも数倍早く収束し、収束時の優れたパフォーマンスを達成する。一方、2つの構造的改善が提案されている。まず, 連続VAEによる遅延次元拡大による再現性向上に着想を得て, トークン圧縮比を妥協することなくPSNRの約1dB向上を実現するマルチトークン量子化機構を導入する。第2に、高圧縮ビデオVAEにおける再構成課題に対処するため、第1フレームの再構築を強化し、第2フレームでの因果的VAEによるこれらの情報の活用を可能にし、4 x 16 x 16 の離散VAEの性能を著しく向上させる。さらに,2つのパラダイムを統一し,単一ネットワーク内の連続表現と離散表現の両面での競合性能を初めて達成する連立離散連続最適化手法を提案する。我々はこの接続を反映する手法を OneVAE と名付けた。

論文の概要: OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

関連論文リスト