Related papers: OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

URL: http://arxiv.org/abs/2508.09857v1
Date: Wed, 13 Aug 2025 14:49:54 GMT
Title: OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better
Authors: Yupeng Zhou, Zhen Li, Ziheng Ouyang, Yuming Chen, Ruoyi Du, Daquan Zhou, Bin Fu, Yihao Liu, Peng Gao, Ming-Ming Cheng, Qibin Hou,
Abstract summary: We show that FSQ could effectively pre-trained continuous VAE priors compared to other quantization methods.<n>We introduce a multi-token quantization mechanism, which achieves nearly a 1 improvement in PSNR dimensions without compromising the token compression ratio.<n>We propose a joint discrete-continuous optimization scheme that unifies the two paradigms.
Score: 75.24657690640525
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Encoding videos into discrete tokens could align with text tokens to facilitate concise and unified multi-modal LLMs, yet introducing significant spatiotemporal compression compared to continuous video representation. Previous discrete video VAEs experienced unstable training, long training time, and degraded reconstruction quality. Given the easier training and superior performance of continuous VAEs, an intuitive idea is to enhance discrete video VAEs by leveraging continuous VAEs. After rethinking the intrinsic link between discrete and continuous representations, we found that FSQ could effectively preserve pre-trained continuous VAE priors compared to other quantization methods. By leveraging continuous VAE priors, it converges several times faster than training from scratch and achieves superior performance at convergence. Meanwhile, two structural improvements are proposed. First, inspired by how continuous VAEs enhance reconstruction via enlarged latent dimensions, we introduce a multi-token quantization mechanism, which achieves nearly a 1 dB improvement in PSNR without compromising the token compression ratio. Second, to tackle reconstruction challenges in high-compression video VAEs, we strengthen first-frame reconstruction, enabling the causal VAE to leverage this information in subsequent frames and markedly improving the performance of 4 x 16 x 16 discrete VAEs. Furthermore, we propose a joint discrete-continuous optimization scheme that unifies the two paradigms and, for the first time, achieves competitive performance on both continuous and discrete representations within a single network. We name our method OneVAE to reflect this connection.

Related papers

GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining [64.72014392166625]
GMS-CAVP is a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives.<n>First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities.<n>Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio.
arXiv Detail & Related papers (2026-01-27T13:43:32Z)
SFTok: Bridging the Performance Gap in Discrete Tokenizers [72.9996757048065]
We propose textbfSFTok, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction.<n>At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet.
arXiv Detail & Related papers (2025-12-18T18:59:04Z)
Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance [24.88807532823577]
We propose S2VC, a Single-Step diffusion based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator.<n>We show that S2VC delivers state-of-the-art perceptual quality with an average 52.73% saving over prior perceptual methods.
arXiv Detail & Related papers (2025-12-08T12:05:30Z)
VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling [22.005420177236804]
We propose VAEVQ, which comprises three key components: (1) Variational Latent Quantization (VLQ), replacing the AE with a VAE for quantization to leverage its structured and smooth latent space, thereby facilitating more effective codeword activation; (2) Representation Coherence Strategy (RCS), adaptively modulating the alignment strength between pre- and post-quantization features to enhance consistency and prevent overfitting to noise; and (3) Distribution Consistency Regularization (DCR), aligning the entire codebook distribution with the continuous latent distribution to improve utilization.
arXiv Detail & Related papers (2025-11-10T09:07:23Z)
OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z)
DiffVC-OSD: One-Step Diffusion-based Perceptual Neural Video Compression Framework [45.134271969594614]
We first propose DiffVC-OSD, a One-Step Diffusion-based Perceptual Neural Video Compression framework.<n>We employ an End-to-End Finetuning strategy to improve overall compression performance.
arXiv Detail & Related papers (2025-08-11T06:59:23Z)
CoVAE: Consistency Training of Variational Autoencoders [9.358185536754537]
We propose a novel single-stage generative autoencoding framework that adopts techniques from consistency models to train a VAE architecture.<n>We show that CoVAE can generate high-quality samples in one or few steps without the use of a learned prior.<n>Our approach provides a unified framework for autoencoding and diffusion-style generative modeling and provides a viable route for one-step generative high-performance autoencoding.
arXiv Detail & Related papers (2025-07-12T01:32:08Z)
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [70.4360995984905]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack [71.2286719703198]
We propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adrial Attack (ReToMe-VA) To achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adrial Latent Optimization (TALO) strategy. To achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames.
arXiv Detail & Related papers (2024-08-10T08:10:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.