Related papers: Rethinking Video Tokenization: A Conditioned Diffusion-based Approach

Rethinking Video Tokenization: A Conditioned Diffusion-based Approach

URL: http://arxiv.org/abs/2503.03708v3
Date: Thu, 27 Mar 2025 11:46:22 GMT
Title: Rethinking Video Tokenization: A Conditioned Diffusion-based Approach
Authors: Nianzu Yang, Pandeng Li, Liming Zhao, Yang Li, Chen-Wei Xie, Yehui Tang, Xudong Lu, Zhihang Liu, Yun Zheng, Yu Liu, Junchi Yan,
Abstract summary: New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
Score: 58.164354605550194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing video tokenizers typically use the traditional Variational Autoencoder (VAE) architecture for video compression and reconstruction. However, to achieve good performance, its training process often relies on complex multi-stage training tricks that go beyond basic reconstruction loss and KL regularization. Among these tricks, the most challenging is the precise tuning of adversarial training with additional Generative Adversarial Networks (GANs) in the final stage, which can hinder stable convergence. In contrast to GANs, diffusion models offer more stable training processes and can generate higher-quality results. Inspired by these advantages, we propose CDT, a novel Conditioned Diffusion-based video Tokenizer, that replaces the GAN-based decoder with a conditional causal diffusion model. The encoder compresses spatio-temporal information into compact latents, while the decoder reconstructs videos through a reverse diffusion process conditioned on these latents. During inference, we incorporate a feature cache mechanism to generate videos of arbitrary length while maintaining temporal continuity and adopt sampling acceleration technique to enhance efficiency. Trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch, extensive experiments demonstrate that CDT achieves state-of-the-art performance in video reconstruction tasks with just a single-step sampling. Even a scaled-down version of CDT (3$\times$ inference speedup) still performs comparably with top baselines. Moreover, the latent video generation model trained with CDT also exhibits superior performance. The source code and pretrained weights are available at https://github.com/ali-vilab/CDT.

Related papers

SCALED : Surrogate-gradient for Codec-Aware Learning of Downsampling in ABR Streaming [9.436544348188598]
Over-the-Top (OTT) delivery now predominantly relies on Adaptive Bitrate (ABR) streaming.<n>Deep learning has spurred interest in jointly optimizing the ABR pipeline using learned resampling methods.<n>We introduce a novel framework that enables end-to-end training with real, non-differentiable codecs.
arXiv Detail & Related papers (2026-01-30T10:38:35Z)
InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior [13.775331675468024]
We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior.<n>We show that InstantViR is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.
arXiv Detail & Related papers (2025-11-18T07:40:38Z)
SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization [56.12853087022071]
We introduce a new pixel diffusion decoder architecture for improved scaling and training stability.<n>We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder.<n>This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses.
arXiv Detail & Related papers (2025-10-06T15:57:31Z)
Conditional Video Generation for High-Efficiency Video Compression [47.011087624381524]
We propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction.<n>Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals.
arXiv Detail & Related papers (2025-07-21T06:16:27Z)
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [70.4360995984905]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z)
OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates [52.65036099944483]
Pretrained latent diffusion models have shown strong potential for lossy image compression.<n>Most existing methods reconstruct images by iteratively denoising from random noise.<n>We propose a one-step diffusion across multiple bit-rates termed OSCAR.
arXiv Detail & Related papers (2025-05-22T00:14:12Z)
Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion [28.61304513668606]
ResULIC is a residual-guided ultra lowrate image compression system.<n>It incorporates residual signals into both semantic retrieval and the diffusion-based generation process.<n>It achieves superior objective and subjective performance compared to state-of-the-art diffusion-based methods.
arXiv Detail & Related papers (2025-05-13T06:51:23Z)
Generative Pre-trained Autoregressive Diffusion Transformer [54.476056835275415]
GPDiT is a Generative Pre-trained Autoregressive Diffusion Transformer.<n>It unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis.<n>It autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics.
arXiv Detail & Related papers (2025-05-12T08:32:39Z)
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling. Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions. We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z)
TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation [34.73820805875123]
TIDE (Temporal-aware Sparse Autoencoders for Interpretable Diffusion transformErs) is a novel framework that enhances temporal reconstruction within DiT activation layers across denoising steps. TIDE employs Sparse Autoencoders (SAEs) with a sparse bottleneck layer to extract interpretable and hierarchical features. Our approach achieves state-of-the-art reconstruction performance, with a mean squared error (MSE) of 1e-3 and a cosine similarity of 0.97.
arXiv Detail & Related papers (2025-03-10T08:35:51Z)
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation [54.21476271127356]
Divot is a Diffusion-Powered Video Tokenizer.<n>We present Divot-unaVic through video-to-text autoregression and text-to-video generation.
arXiv Detail & Related papers (2024-12-05T18:53:04Z)
High-Efficiency Neural Video Compression via Hierarchical Predictive Learning [27.41398149573729]
Enhanced Deep Hierarchical Video Compression-DHVC 2.0- introduces superior compression performance and impressive complexity efficiency. Uses hierarchical predictive coding to transform each video frame into multiscale representations. Supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss.
arXiv Detail & Related papers (2024-10-03T15:40:58Z)
Sample what you cant compress [6.24979299238534]
We show how to learn a continuous encoder and decoder under a diffusion-based loss. This approach yields better reconstruction quality as compared to GAN-based autoencoders. We also show that the resulting representation is easier to model with a latent diffusion model as compared to the representation obtained from a state-of-the-art GAN-based loss.
arXiv Detail & Related papers (2024-09-04T08:42:42Z)
Uncertainty-Aware Deep Video Compression with Ensembles [24.245365441718654]
We propose an uncertainty-aware video compression model that can effectively capture predictive uncertainty with deep ensembles. Our model can effectively save bits by more than 20% compared to 1080p sequences.
arXiv Detail & Related papers (2024-03-28T05:44:48Z)
Boosting Neural Representations for Videos with a Conditional Decoder [28.073607937396552]
Implicit neural representations (INRs) have emerged as a promising approach for video storage and processing. This paper introduces a universal boosting framework for current implicit video representation approaches.
arXiv Detail & Related papers (2024-02-28T08:32:19Z)
Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features. We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps. We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z)
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Lossy Image Compression with Conditional Diffusion Models [25.158390422252097]
This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics.
arXiv Detail & Related papers (2022-09-14T21:53:27Z)
A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs) The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved. We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z)
Investigating Tradeoffs in Real-World Video Super-Resolution [90.81396836308085]
Real-world video super-resolution (VSR) models are often trained with diverse degradations to improve generalizability. To alleviate the first tradeoff, we propose a degradation scheme that reduces up to 40% of training time without sacrificing performance. To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences.
arXiv Detail & Related papers (2021-11-24T18:58:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.